Categories
News

Rowboat vs Claude Cowork: Local Open-Source AI Coworker With a Knowledge Graph

Rowboat (rowboatlabs/rowboat, Apache-2.0) is positioning itself as a free, local-first alternative to Anthropic’s Claude Cowork: an open-source desktop coworker that turns Gmail, calendar, and meeting notes into an Obsidian-style knowledge graph, then acts on that context with your choice of Ollama, LM Studio, or hosted LLMs—without locking you to Claude subscriptions or cloud-only memory.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    S[Gmail Calendar Fireflies] --> G[Markdown knowledge graph]
    G --> A[Rowboat agent]
    A --> M[MCP tools Exa Slack GitHub]
    A --> L[Local or hosted LLM]
    L --> O[Briefs decks PDFs voice]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class S,O agent
    class G,A,M,L hook
Gmail and meetings to Markdown graph local agent and BYO LLM

What Claude Cowork is (baseline)

Claude Cowork is Anthropic’s agentic layer inside the Claude desktop app (macOS and Windows): you grant access to a folder (and optional connectors), describe an outcome, and Claude plans multi-step work across local files, connectors, and the browser. It shares DNA with Claude Code but targets non-technical knowledge work—sorting downloads, synthesising reports, spreadsheet cleanup, scheduled tasks on paid plans.

  • Models: Anthropic Claude only (e.g. Opus-class on paid tiers).
  • Cost: Bundled with paid Claude plans—not on the free chat tier.
  • Connectors: 38+ native integrations (Gmail, Slack, Notion, GitHub, etc.) per Anthropic’s connector docs.
  • Strength: polished UX, computer/browser control, enterprise features (with noted gaps on Cowork in audit/compliance APIs).
Paid Anthropic coworker versus open source local knowledge graph app

What Rowboat adds

Rowboat’s thesis is long-lived work memory, not cold retrieval each session. It ingests email and meetings, extracts people/projects/decisions/commitments, and writes them as linked Markdown you can open in Obsidian. An on-machine agent then drafts, plans, generates PDF decks, or produces voice briefs—grounded in that graph.

CapabilityRowboat
Knowledge storeLocal Obsidian-compatible vault (backlinks, editable)
InputsGmail, Google Calendar, Rowboat notes, Fireflies; Composio library
LLMOllama, LM Studio, or bring-your-own API keys
ToolsMCP servers (~/.rowboat/config/mcp.json), optional Composio
VoiceOptional Deepgram (input) and ElevenLabs (output)
Web researchOptional Exa key; live notes via @rowboat on a note
Background workScheduled tasks; HN notes: shell on background tasks still limited—MCP/built-ins OK
LicenceApache-2.0 (~14k+ GitHub stars)

Side-by-side: Cowork vs Rowboat

DimensionClaude CoworkRowboat
PricePaid Claude subscriptionFree app; you pay only for optional APIs/models
Model choiceClaude onlyLocal or any hosted provider
Memory modelProjects / Anthropic-managed contextExplicit local knowledge graph (Markdown)
Privacy postureCloud inference + connectorsGraph on disk; LLM can stay on localhost via Ollama
ExtensibilityAnthropic connectors + MCP ecosystemMCP + Composio; plain config files under ~/.rowboat/config/
Best for“Do this folder task end-to-end” with top-tier Claude“Remember my work and prep/act with compounding context”

Install and configure Rowboat

Official installers for Mac, Windows, and Linux: rowboatlabs.com/downloads or GitHub releases. Optional setup files (all JSON with {"apiKey": "..."}):

  • google-setup.md — Gmail, Calendar, Drive (graph building is read-only on inbox today)
  • ~/.rowboat/config/deepgram.json — voice input
  • ~/.rowboat/config/elevenlabs.json — voice output / briefs
  • ~/.rowboat/config/exa-search.json — Exa research search
  • ~/.rowboat/config/composio.json — Composio tool library
  • ~/.rowboat/config/mcp.json — custom MCP servers

Example prompts from the README: “Prep me for my meeting with Alex”, “Build me a deck about our next quarter roadmap” (PDF from graph context), or live notes that track a person/project across sources.

Another “local Cowork” path (different project)

If you specifically want the Claude Desktop Cowork UI routing to local models, community projects such as local-ollama-claude-desktop-server proxy Cowork sessions to Ollama instead of Anthropic’s cloud. That is orthogonal to Rowboat: Rowboat is its own app and memory model, not a patch on Claude Desktop.

When to pick which

Choose Claude Cowork if…Choose Rowboat if…
You already pay for Claude Max/Pro and want Anthropic polishYou want $0 software and optional local inference
Tasks are file-folder automation with browser controlYou need durable, inspectable work memory across email/meetings
Claude-only is acceptableYou must swap models (Ollama today, another host tomorrow)
Enterprise connector catalogue matters mostMarkdown vault + MCP fits your ops/security model

Operational summary

QuestionAnswer
Is Rowboat really “100% local”?Graph and app are local-first; connectors (Google, Exa, hosted LLM) are optional and configurable
Default LLM?You configure Ollama, LM Studio, or an API provider
Star count?~14k+ on GitHub (growing; social posts cited ~14.3k)
DemoYouTube walkthrough linked from README
CommunityDiscord · Show HN discussion May 2026

Rowboat does not clone every Cowork feature on day one—background shell automation and inbox write-actions are still evolving—but it attacks the harder problem for many knowledge workers: context that compounds without renting a closed memory silo. For teams comparing agent desktops, treat Cowork as the polished proprietary agent and Rowboat as the inspectable, model-agnostic graph-first coworker you can run on your own machine.

Categories
News

SkillOpt Explained: Train Agent SKILL.md Files With Validation Gates, Not Hope

Microsoft Research’s SkillOpt (arXiv:2605.23904) reframes agent skills: instead of hand-writing SKILL.md files and hoping they generalise, you train the skill document as external state while the target model stays frozen—complete with rollout batches, bounded textual edits, held-out validation gates, and a deployable best_skill.md artefact.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    S[Initial skill] --> R[Rollouts on train split]
    R --> O[Optimizer reflects on trajectories]
    O --> E[Bounded add/delete/replace edits]
    E --> V{Held-out selection improves?}
    V -->|yes| A[Accept to best_skill.md]
    V -->|no| B[Rejected-edit buffer]
    B --> O
    A --> D[Deploy: zero extra inference calls]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class S,A,D agent
    class R,O,E hook
    class V,B decision
Frozen model rollouts bounded edits and held-out validation for best_skill.md

Why skills need an optimizer, not another rewrite loop

Agent skills package procedures—tool policies, output formats, failure handling—for Cursor, Claude Code, Codex, and similar harnesses. Most teams still author them manually, generate them once with an LLM, or let agents “self-improve” with weak acceptance criteria. SkillOpt argues that none of that behaves like reproducible training: you need the same discipline as weight-space optimisation, but over a single auditable markdown document.

The loop is harness-agnostic via adapters: a frozen target model runs tasks with the current skill; a separate optimizer model proposes structured edits from trajectory reflection; only edits that strictly improve a held-out selection score are kept. Rejected edits land in a buffer so the optimizer does not repeat harmful changes. An epoch-wise slow/meta update writes durable lessons into a protected field that fast edits cannot overwrite.

Validation gate bounded edits compact skills harness transfer verifiers

Six lessons for builders (paper + field notes)

LessonWhat SkillOpt showsPractical takeaway
1. Validation gate > volumeBest runs accept only 1–4 edits end-to-end; ties rejectedIf your self-editing agent accepts most proposals, you are shipping noise
2. Bounded editsTextual learning rate lr=4 edits/step beats unbounded rewrite (ablation collapses without budget)Cap diff size for any LLM-as-author loop (docs, prompts, skills)
3. CompactnessDeployed skills often 300–2,000 tokens after trainingLength is not quality; optimise for signal density
4. Harness transferCodex-trained spreadsheet skill → Claude Code: +59.7 pts on SpreadsheetBenchProcedural knowledge can outlive the runtime that produced it
5. Frozen model + trained contextGPT-5.4-nano + optimised skill ≈ strong frontier behaviour on procedural benchmarksDomain adaptation without fine-tuning weights
6. Verification bottleneckEvery gate uses auto-graders (works on benchmarks)Open-ended writing/design/strategy still needs human or better verifiers

Headline results (GPT-5.5, direct chat)

Across six benchmarks, seven target models, and three harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells versus human skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill.

BenchmarkNo skillSkillOptGain
SearchQA77.787.3+9.6
SpreadsheetBench41.880.7+38.9
OfficeQA33.172.1+39.0
DocVQA78.891.2+12.4
LiveMathematicianBench37.666.9+29.3
ALFWorld83.695.5+11.9

Average lift over no skill: +23.5 points (direct chat), +24.8 (Codex harness), +19.1 (Claude Code harness) on GPT-5.5. Deployment adds zero extra inference-time optimizer calls—you ship the final markdown skill only.

Method mechanics (the “gradient descent for SKILL.md” analogy)

  • Rollout batch size — evidence noise vs throughput (paper sweeps 8–full epoch).
  • Reflection minibatches — separate failure/success groups so edits capture patterns, not one-off anecdotes.
  • Patch vs rewrite — local add/delete/replace vs full rewrite; fast edits cannot clobber protected slow-update sections.
  • Schedules — constant, linear, cosine, or autonomous edit budgets (cosine default: large early steps, smaller consolidation later).
  • Rejected-edit buffer — failed proposals become negative feedback within the epoch.

Skill routing: description vs body

Independent of Microsoft’s paper, multi-skill harnesses expose two surfaces: the description (what the router sees before activation) and the body (what the agent sees after). They can disagree silently—rewriting descriptions moved individual skills by 23–25 percentage points in recent cross-model SDK tests while corpus averages barely moved. Optimise and evaluate per skill, not only aggregate accuracy.

Protected slow state vs fast state

SkillOpt’s protected section invariant mirrors a fast/slow memory split: durable voice or tone guides should not be overwritten by high-churn logs. Ablations show removing that mechanism cost roughly 22 points on SpreadsheetBench—worth adopting in any self-editing skill repo that mixes stable principles with ephemeral task memory.

How this fits the wider skills ecosystem

SkillsBench shows curated skills help on average (+16.2 pp) but effects vary by domain—and self-generated skills often do not help without proper optimisation. SkillOpt is the optimizer-shaped complement: move from “store procedures in markdown” to “measure and train markdown with validation gates.” Related evolution work (Trace2Skill, EvoSkill, GEPA, TextGrad) targets prompts or looser revision; SkillOpt targets one compact, exportable skill file per domain.

Resources

Operational summary

QuestionAnswer
What is being trained?A single natural-language skill document (external state)
What stays frozen?Target model weights and deployment harness
Acceptance rule?Strict improvement on held-out selection split; ties fail
Typical edit budget?~4 edits/step (textual learning rate); unbounded rewrite hurts
Best artifact?best_skill.md, ~300–2k tokens, auditable text
Main open problem?Verifiers for non-benchmark, open-ended work

If your agent stack already uses skills, SkillOpt is the research-backed pattern for turning them from static prompts into measured adaptations: propose small edits, validate on held-out tasks, reject aggressively, and export one portable file. The harness matters less over time; the high-signal skill matters more.

Validation gate bounded edits compact skills harness transfer verifiers

Research supplement

Web search was unavailable in this environment. No external sources could be verified at time of writing. The findings cited in this post are drawn directly from the article's description of arXiv:2605.23904. Readers should consult the paper directly at the arXiv link and the official project page at microsoft.github.io/SkillOpt (linked in the article) to verify benchmark figures, ablation results, and harness-specific numbers before citing them.

---
Categories
News

Grok Build CLI: xAI Terminal Coding Agent with Plan Mode, Subagents, and Headless CI

xAI’s Grok Build is a terminal-native coding agent and CLI—announced 25 May 2026 as an early beta for SuperGrok and X Premium Plus subscribers. It plans multi-step engineering work, edits across files, runs shell commands, and ships with the same grok-build-0.1 model exposed on the xAI API, positioning xAI directly against Claude Code, Codex CLI, and other agentic dev tools.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    R[Repo + AGENTS.md] --> G[grok CLI]
    G --> P{Plan mode?}
    P -->|yes| A[Approve plan]
    P -->|no| T[Tool calls]
    A --> T
    T --> S[Subagents parallel]
    S --> D[Diffs in repo]
    G --> H[Headless grok -p]
    G --> API[grok-build-0.1 API]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class R,D agent
    class G,T,S,H,API hook
    class P decision
Plan mode subagents and diffs with AGENTS.md and MCP discovery

What Grok Build is

Per the Introducing Grok Build announcement and Build documentation, Grok Build is not a chat sidebar—it is an agent loop in your shell:

  • Interactive TUI: fullscreen, mouse-aware terminal UI launched with grok inside a project.
  • Headless mode: grok -p "…" for scripts, CI, and bots—with plain, JSON, or streaming-json output.
  • ACP: Agent Client Protocol support so other apps can host the same agent.
  • API parity: grok-build-0.1 on the Responses API for custom IDE or orchestration stacks.
Interactive TUI headless grok -p and grok-build-0.1 API

Install and authenticate

# macOS / Linux
curl -fsSL https://x.ai/cli/install.sh | bash

# Windows PowerShell
irm https://x.ai/cli/install.ps1 | iex

cd your-project
grok   # first run opens browser sign-in

# Headless servers
export XAI_API_KEY="xai-..."
grok

Access is gated to SuperGrok and X Premium Plus during early beta. The landing page at x.ai/cli mirrors the same install line.

Plan, review, approve

The announcement emphasises plan mode for risky work: Grok drafts a structured approach, you approve, comment on steps, or rewrite the plan before execution. Official TUI behaviour (modes and commands):

  • Plan mode blocks write tools except the session plan file until you are ready.
  • /plan shows the working plan in the TUI; Shift+Tab cycles session modes.
  • After approval, changes surface as clean diffs—the same review pattern teams expect from other coding agents.

Works with your existing agent stack

xAI states that AGENTS.md, plugins, hooks, skills, and MCP servers work out of the box—important for teams that already invested in Cursor/Codex-style conventions. Run grok inspect in a repo to see discovered config sources, instructions, skills, plugins, hooks, and MCP endpoints before you prompt.

Extension surfaceCLI entry
Hooks/hooks (extensions modal)
Plugins/plugins
Skills/skills plus user-invocable /skill-name
MCP/mcps

The product page also highlights /skillify to capture a session as a reusable skill, and integrations such as Linear, Sentry, and Grafana via MCP—typical production-ops wiring for agentic dev workflows.

Parallel subagents and git worktrees

For larger tasks, Grok Build delegates to specialised subagents in parallel—research, implementation, and review can run concurrently. The announcement adds deep git worktree integration: subagents can run in isolated worktrees so parallel edits do not stomp the main branch. That mirrors how power users already parallelise Claude Code or custom orchestrators, but ships as a first-class xAI story.

Headless mode and permissions

grok -p "Explain this codebase"
grok -p "Add unit tests for auth module" --output-format streaming-json

# Skip interactive approvals (CI only—use with care)
grok --always-approve

Default permission behaviour is ask (prompt per tool call). Set global defaults in ~/.grok/config.toml:

[ui]
permission_mode = "always-approve"   # or "ask"

Project-scoped overrides live under .grok/config.toml where supported; user-level permission mode belongs in the home config per docs.

grok-build-0.1 on the API

The CLI model slug aligns with API early access. Example from Getting Started:

curl https://api.x.ai/v1/responses \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-build-0.1",
    "input": "Refactor this function to handle null inputs."
  }'

xAI’s May 2026 model retirement guide routes retired grok-code-fast-1 traffic to grok-build-0.1 and recommends explicit migration for coding workloads—signalling that Build is the dedicated coding line, not a generic Grok 4.3 wrapper.

Useful TUI commands (quick reference)

CommandPurpose
/contextContext window usage
/compactShrink conversation history
/rewindRollback conversation state
/fork, /sessionsBranch or resume sessions
/usageToken and credit usage
/feedbackSend beta feedback to xAI
/btwSide question without interrupting main task
/memory, /dreamPersistent memory search and consolidation

How it compares in the market

DimensionGrok Build angle
DistributionSingle curl install; ties into X/SuperGrok subscriptions
ExtensibilityAGENTS.md + MCP + skills parity with emerging agent standards
AutomationFirst-class -p headless + ACP + API model slug
ParallelismSubagents + worktree isolation advertised upfront
MaturityEarly beta—/feedback is the intended quality loop

Operational summary

QuestionAnswer
Who can install today?SuperGrok and X Premium Plus (early beta)
Default interactive command?grok in project root
CI / bot entrypoint?grok -p "task" with optional streaming-json
Coding API model?grok-build-0.1 (replaces grok-code-fast-1 path)
Safe default for prod automation?permission_mode = "ask" unless you trust the sandbox

Grok Build is xAI’s bet that coding agents belong in the terminal and the API, not only in the Grok chat app. If you already standardise on AGENTS.md and MCP, the migration cost is mostly subscription and habit—run grok inspect, try plan mode on a multi-file refactor, and only then point headless -p jobs at CI once diffs and permissions look right.

Research supplement

Web search was unavailable during generation. No external sources could be verified beyond the article text. The following context is drawn from the author's knowledge of the competitive landscape as of May 2026 and should be confirmed against primary sources before publication.

  • SWE-bench Verified is the standard third-party benchmark for agentic coding performance (pass rate on real GitHub issues). No grok-build-0.1 results were available at time of writing; monitoring swebench.com for new submissions is recommended.
  • AGENTS.md convention was popularized by OpenAI's Codex CLI documentation and has been adopted by multiple CLI agent tools. A primary reference for the spec can be found in OpenAI's Codex CLI repository on GitHub.
  • Agent Client Protocol (ACP) is referenced in the article but not linked. It originates from the BeeAI / IBM Research ecosystem; verifying xAI's specific implementation against the published spec is advised before citing it as standards-compliant.
---
Categories
News

LongCat-Video-Avatar 1.5: Open-Source Talking Heads from One Photo and Audio

LongCat-Video-Avatar 1.5 is Meituan’s MIT-licensed stack for audio-driven talking avatars: feed one portrait (or skip the image entirely), add speech audio and an optional text prompt, and get a lip-synced clip with stable identity—now tuned for 8-step distilled inference and Whisper-Large-v3 lip dynamics instead of the older Wav2Vec2 path.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    A[Audio input] --> W[Whisper-Large-v3 encoder]
    T[Text prompt] --> D[LongCat-Video DiT]
    W --> D
    I[Reference image optional] --> D
    D --> V[Avatar video 480P or 720P]
    V --> C[Video continuation optional]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A,I agent
    class W,D hook
    class V,C decision
Audio Whisper encoder DiT and lip-synced video output

What changed in version 1.5

Version 1.0 already targeted commercial-style avatars on top of the LongCat-Video foundation. Version 1.5 is an empirical push toward production serving:

  • Whisper-Large-v3 replaces Wav2Vec2 for audio conditioning—smoother mouth shapes and stronger multilingual speech handling.
  • DMD2 step distillation cuts diffusion to 8 NFE (--use_distill, required for v1.5).
  • INT8 DiT (--use_int8) lowers VRAM for self-hosted runs.
  • Long-form stability: identity, full-body motion, and lip sync held across extended generations and object-handling scenes.
  • Broader domains: realistic humans, anime, animals, multi-speaker dialogue, and dual-audio layouts.
AT2V ATI2V continuation and dual-audio modes

Native task modes

ModeInputsUse case
AT2V (audio-text-to-video)Audio + descriptive promptGenerate a speaker without a reference photo
ATI2V (audio-text-image-to-video)Single image + audio + promptClassic talking-head / dubbing from one portrait
Video continuationPrior segment + --num_segmentsExtend clips beyond the first generation window
Dual-audioTwo streams (para or add)Two avatars in one scene (parallel mix vs turn-taking)

Dual-audio semantics (from the official README): para merges two equal-length clips by summing waveforms; add concatenates unequal clips with silence padding (person1 first, then person2).

Why lip sync breaks on other avatar models

Audio-driven avatars usually fail in predictable ways: lips drift after a few seconds, faces morph mid-clip, or identity collapses on longer speech. LongCat’s team positions 1.5 around audio CFG (classifier-free guidance on the audio branch—README recommends 3–5 for best sync), richer text prompts (appearance, action, scene), and continuation controls (--ref_img_index, --mask_frame_range) to reduce repeated gestures without introducing heavy artefacts.

Evaluation design (human + expert)

The model card describes a bespoke benchmark for audio-driven digital humans:

  • 508 image–audio source pairs
  • 6 scenarios: news broadcasting, knowledge education, daily life, entertainment, singing, commercial promotion
  • 2 languages (Chinese and English) and 2 visual styles (realistic and animated)
  • Subjective track: 770 crowd evaluators, 13,240 human-likeness scores (1–5)
  • Objective track: 10 experts scoring physical rationality, audio-visual harmony, temporal stability, identity consistency

The project page also publishes side-by-side comparisons with commercial avatar stacks (HeyGen, Kling Avatar 2.0, OmniHuman-1.5) on lip-sync, singing, animation, and performance clips—useful for qualitative judgement, not a substitute for your own domain tests.

Deployment reality check

TopicDetail
LicenceMIT on released weights (Meituan trademarks/patents excluded)
Weight sizeLarge multi-component bundle (base video model, avatar LoRAs, Whisper, schedulers—plan ~75 GB class storage)
Reference inferencetorchrun --nproc_per_node=2 with context parallel size 2
v1.5 flags--model_type avatar-v1.5 --use_distill; add --use_int8 to save VRAM
Resolution480P or 720P via --resolution
StackPython 3.10, PyTorch 2.6 + CUDA 12.4, FlashAttention-2 (optional FA-3 / xFormers)

Quick start (ATI2V, single person)

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video
# conda env + pip install per README (torch, flash-attn, requirements_avatar.txt)

huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5

# Audio + image to video (edit assets/avatar/single_example_1.json)
torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py \
  --context_parallel_size=2 \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --stage_1=ai2v \
  --input_json=assets/avatar/single_example_1.json \
  --use_distill --model_type avatar-v1.5 --use_int8

Where builders use it

VerticalFitWatch-outs
Localisation / dubbingRe-voice existing faces with new audio tracksConsent, disclosure, and platform policies
E-commerce & educationPresenter clips from still product or instructor photosPrompt quality drives realism
Broadcast-style newsAT2V when no reference portrait existsExpert review for factual content
Multi-host showsDual-audio + multi-person scriptsHeavier GPU footprint than SaaS APIs

Operational summary

QuestionAnswer
Minimum viable inputs?Audio + text (AT2V) or audio + image + text (ATI2V)
Why is 1.5 faster?DMD2 distillation to 8 denoising steps
Best lip-sync knob?Audio CFG roughly 3–5; use avatar-v1.5 + Whisper encoder
Open source?GitHub + Hugging Face weights
vs per-minute SaaS avatars?You own inference cost on your GPUs—trade simplicity for control

For teams already running generative video on a VPS or internal GPU cluster, LongCat-Video-Avatar 1.5 is the rare open-weight release that targets serving economics (8-step + INT8) and sync quality (Whisper audio encoder) together—not just leaderboard demos. Prototype on a two-GPU node, measure sync on your own audio, then decide whether self-hosting beats per-minute commercial APIs for your volume.

Research supplement

The audio encoder upgrade at the center of version 1.5 — replacing Wav2Vec2 with Whisper-Large-v3 — is grounded in a meaningful architectural difference. Wav2Vec2 was pre-trained primarily on English speech for downstream ASR fine-tuning, whereas Whisper-Large-v3 is a multilingual encoder trained on 680,000 hours of supervised audio across 99 languages. For an audio-driven avatar model, the encoder's learned representations directly shape the lip-motion conditioning signal, so the switch is not cosmetic: multilingual and accented speech that Wav2Vec2 mapped poorly gets richer phoneme-level features from Whisper. The official Whisper-Large-v3 model card on Hugging Face documents language coverage, word-error rates, and architectural details relevant to anyone evaluating whether the encoder is appropriate for their target speech domain.

---

References

Categories
News

TRELLIS.2-4B Explained: Single-Image 3D with O-Voxel, PBR GLB, and 3-Second Inference

TRELLIS.2-4B is Microsoft’s open-weight image-to-3D stack: a 4-billion-parameter flow-matching model that emits O-Voxel structured latents, converts to a textured mesh in milliseconds on CUDA, and exports PBR-ready GLB for Blender, Unity, and Unreal—about three seconds end-to-end at 512³ on an NVIDIA H100.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
    I[Input image] --> F[Flow-matching DiT 4B]
    F --> V[Sparse 3D VAE latent]
    V --> O[O-Voxel volume PBR]
    O --> M[Mesh extract]
    M --> G[GLB UV bake]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class I,G agent
    class F,V,O,M hook
Image through flow matching O-Voxel to PBR GLB export

What O-Voxel changes

Most 3D generators still lean on iso-surface fields (SDFs, occupancy grids, Flexicubes-style extractors). Those pipelines struggle with open surfaces, thin parts, and non-manifold geometry—and they often decouple materials from shape. O-Voxel (in the o-voxel library) is a sparse, native 3D representation using a flexible dual grid: geometry and PBR volumetric attributes (base colour, metallic, roughness, opacity for translucency) live in the same structure, with bidirectional mesh conversion that avoids slow SDF flood-fills or iterative optimisation.

  • Field-free: no marching-cubes bottleneck on arbitrary topology.
  • Production path: o_voxel.postprocess.to_glb cleans meshes, optional remesh, UV unwrap, texture bake.
  • Compact storage: custom .vxz sparse encoding (Z-order / Hilbert curves).
512 1024 and 1536 voxel grids with H100 timing breakdown

Speed and resolution ladder

Reported inference on NVIDIA H100 (from the TRELLIS.2 README and model card):

Voxel gridTotal timeShape + materials
512³~3 s2 s + 1 s
1024³~17 s10 s + 7 s
1536³~60 s35 s + 25 s

The social-demo claim of “3 seconds” matches the 512³ preset—not the highest 1536³ quality tier. Mesh extraction from O-Voxel to a renderable surface is advertised as under 100 ms on CUDA for the conversion step; full GLB export adds remeshing, decimation (targets up to 1M faces), and texture sizes up to 4096 px when you crank quality.

Model stack (4B parameters)

ComponentRole
TRELLIS.2-4BFlow-matching transformer; single-image conditioning
Sparse 3D VAE16× spatial downsampling into compact latents
O-VoxelStructured latent holding shape + PBR volume
OutputsPreview MP4 (PBR + HDRI) and sample.glb

Licence: MIT for weights and code. Paper: arXiv:2512.14692 · Project: microsoft.github.io/trellis.2

Quick start (local GPU)

git clone -b main https://github.com/microsoft/TRELLIS.2.git --recursive
# Follow repo setup: CUDA 12.4 recommended, conda env, trellis2 + o_voxel install

from PIL import Image
import torch
from trellis2.pipelines import Trellis2ImageTo3DPipeline
import o_voxel

pipeline = Trellis2ImageTo3DPipeline.from_pretrained("microsoft/TRELLIS.2-4B")
pipeline.cuda()

image = Image.open("your_photo.png")
mesh = pipeline.run(image)[0]

glb = o_voxel.postprocess.to_glb(
    vertices=mesh.vertices,
    faces=mesh.faces,
    attr_volume=mesh.attrs,
    coords=mesh.coords,
    attr_layout=mesh.layout,
    voxel_size=mesh.voxel_size,
    aabb=[[-0.5, -0.5, -0.5], [0.5, 0.5, 0.5]],
    decimation_target=1000000,
    texture_size=4096,
    remesh=True,
    verbose=True,
)
glb.export("asset.glb", extension_webp=True)

Where it fits in a 3D pipeline

WorkflowFitCaveat
Game / real-timeFast concept meshes + PBR GLB importMay need retopo and LOD pass
DCC (Blender)GLB with textures; tweak opacity in viewerAlpha in textures not always auto-enabled
AR / ecommerceRapid asset variants from product photosStyle not RLHF-aligned—prompt by image choice
3D printingStarting mesh onlyRaw output may have small holes—use repo hole-fill scripts for watertight solids

Known limitations (official)

  • Geometric holes: occasional discontinuities; post-process for strict watertight use.
  • No preference alignment: outputs reflect training distribution, not a curated “product render” aesthetic.
  • Hardware: practical runs expect a strong CUDA GPU; timings are quoted for H100-class hardware.

Operational summary

QuestionAnswer
What is new?Field-free O-Voxel latents + 4B flow-matching image-to-3D at interactive speeds
Default “3 s” claim?512³ grid on H100 (~2 s geometry + ~1 s materials)
Export format?GLB with PBR textures; optional WebP texture extension
Open source?GitHub + Hugging Face weights, MIT

For teams prototyping 3D content pipelines, TRELLIS.2-4B is less “another NeRF demo” and more a mesh-first generative primitive: sparse voxels in, game-engine-ready GLB out, with topology and materials handled in one representation rather than chained converters.

Research supplement

Web search was unavailable during generation. The following primary sources are cited directly from the article and should be consulted for primary data verification:

  • arXiv preprint 2512.14692 — the TRELLIS.2 paper; contains architecture ablations and benchmark comparisons not reproduced in the article.
  • Microsoft TRELLIS.2 GitHub repository (https://github.com/microsoft/TRELLIS.2) — README, model card, and o_voxel library installation notes; authoritative source for hardware requirements and licence terms.
  • Hugging Face model pagemicrosoft/TRELLIS.2-4B; verify exact licence text on the model card before commercial deployment, as card terms sometimes diverge from repo code licence.

References

Categories
News

Codex Meta-Prompt: Turn Repeated Sessions Into Skills, Subagents, and Automations

OpenAI Codex can turn repeated engineering work into the smallest useful reusable asset—a skill, a custom subagent, or an automation—when you run a structured meta-prompt that mines recent sessions and local memory instead of guessing from one thread.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    S[Recent Codex sessions] --> E[Collect evidence]
    M[Codex Memories ~/.codex/memories] --> E
    C[Chronicle optional] --> E
    X[Existing skills agents automations] --> E
    E --> G{Gates: 2+ repeats stable I/O material benefit}
    G -->|pass| L[Shortlist high-confidence]
    G -->|fail| D[Discard or defer]
    L --> R{Reuse or extend?}
    R -->|yes| U[Update existing asset]
    R -->|no| P[Create smallest package]
    P --> SK[Skill]
    P --> SA[Subagent]
    P --> AU[Automation]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class S,M,C agent
    class X,U,SK,SA,AU hook
    class G,R,L,P decision
Sessions memories chronicle and gates leading to skill subagent or automation

What the meta-prompt is for

The pattern is deliberately conservative: observe repetition, apply hard gates, prefer reuse over new files, and only then materialise a skill, subagent definition, or automation. Typical candidates from real workflows include CI fix loops, PR review checklists, changelog generation, dependency bumps, test triage, and debugging playbooks—but only when the evidence shows the same inputs, tools, and outputs more than once.

Team-wide rules that must always apply belong in AGENTS.md or checked-in docs. Codex Memories are a helpful local recall layer for preferences and recurring patterns, not a substitute for mandatory policy.

Choose smallest package reuse existing assets subagents need explicit spawn

Evidence order (strongest first)

SourceWhat to extractNotes
Recent Codex sessionsTask summaries, tools invoked, failure modes, final artefactsPrimary signal for “what you actually did”
Codex MemoriesDurable preferences, stacks, pitfalls, workflow habitsStored under ~/.codex/memories/ after memories = true
Chronicle (optional)Tools and UI context you did not restate in chatmacOS research preview; see risks below
Existing skills / agents / automationsNames, triggers, overlap with candidate workReuse or extend—do not duplicate

Creation gates (all required)

  • Frequency: the same class of work appears at least twice in eligible evidence (sessions or consolidated memories).
  • Stable I/O: inputs and outputs are predictable enough to script or document (files, commands, PR URLs, test targets).
  • Material benefit: packaging saves meaningful time, reduces errors, or removes context re-explaining—not a one-off curiosity.
  • No duplicate: an existing skill or agent already covers ≥80% of the workflow—extend it instead of creating a parallel asset.

After gates, produce a shortlist (title, evidence count, proposed package type, one-line benefit). Create only high-confidence items; defer the rest.

Skill vs subagent vs automation

PackageBest whenAvoid when
SkillRepeatable procedure with clear steps (lint, release notes, scaffold)Needs parallel exploration across huge artefacts
Custom subagentRead-heavy parallel work (security scan + test gap + style) with distilled summaries back to main threadWrite-heavy parallel edits (merge conflicts)
AutomationHook- or schedule-driven workflow (CI comment bot, nightly audit)Ad-hoc debugging with unstable inputs

Codex Memories (enable and control)

Memories are off by default and are not available in the EEA, UK, or Switzerland at launch. Enable in Codex settings or in ~/.codex/config.toml:

[features]
memories = true

Codex turns useful context from eligible prior threads into local files under ~/.codex/memories/ (summaries, durable entries, recent inputs). Generation runs in the background after idle time, skips active sessions, redacts secrets, and can pause when rate-limit remaining drops below your threshold. Per-thread control uses /memories in the app or TUI without changing global settings.

SettingRole
memories.generate_memoriesWhether new threads feed memory generation
memories.use_memoriesWhether existing memories inject into sessions
memories.disable_on_external_contextSkip memory gen when MCP / web search / tool search was used
memories.min_rate_limit_remaining_percentFloor before background extraction runs

Chronicle (optional screen context)

Chronicle augments memories with recent screen context so Codex can infer which file, dashboard, or doc you were looking at—then prefer reading that source directly when possible. It is an opt-in research preview for ChatGPT Pro on macOS (not EU/UK/CH), requires Screen Recording and Accessibility permissions, and stores generated memories under ~/.codex/memories_extensions/chronicle/.

  • Rate limits: background sandboxed agents consume quota quickly.
  • Prompt injection: malicious on-screen instructions can influence Codex—pause before untrusted sites.
  • Privacy: pause before sensitive content; screen captures are ephemeral (temp under $TMPDIR/chronicle/screen_recording/).

Subagents (manual, parallel)

Subagent workflows move noisy exploration off the main thread: parallel agents return summaries instead of dumping logs into the parent session (reducing context pollution and rot). Codex does not spawn subagents automatically—you must ask explicitly (“spawn two agents…”, “delegate in parallel”). Token cost is higher than a single agent because each subagent runs its own model and tools.

Review this branch with parallel subagents. Spawn one subagent for security risks,
one for test gaps, and one for maintainability. Wait for all three, then summarise
the findings by category with file references.

For model choice: gpt-5.5 for demanding reasoning; gpt-5.4-mini for fast read-heavy workers; pin model and model_reasoning_effort in agent files when you need consistency.

Example meta-prompt (paste into Codex)

Analyse my recent Codex work and package repeated patterns as the smallest useful
skill, custom subagent, or automation.

Evidence (in order):
1) Recent Codex sessions and task summaries
2) Codex Memories under ~/.codex/memories/ (if enabled)
3) Chronicle-derived memories (if enabled)
4) Existing skills, subagents, and automations — reuse or extend, never duplicate

For each candidate:
- Count occurrences (need ≥2)
- Describe stable inputs/outputs and tools used
- State material benefit in one line
- Propose package type: skill | subagent | automation

Apply gates. Output a shortlist table, then create ONLY high-confidence items.
For subagents, remind me they require explicit parallel invocation.
Do not create low-confidence or one-off items.

Operational summary

ConcernPractical takeaway
Policy vs memoryMandatory rules → AGENTS.md; memories → local habits
False positives≥2 occurrences + stable I/O + benefit gate
Asset sprawlInventory existing skills before creating new ones
Subagent costUse for parallel read-heavy triage; ask explicitly
ChronicleHigh recall, higher injection and quota cost—pause when needed

Run the meta-prompt after a week of real project work with memories enabled; review the shortlist before letting Codex write new skill or agent files. The payoff is compounding: less repeated context in every thread, and specialised workers only where parallelism actually wins.

Research supplement

Web search was unavailable in this environment; no external sources could be verified. The sections below are left empty rather than citing unconfirmed URLs.

Categories
News

Adaptive Chunking for RAG: Per-Document Splitters Without Labelled QA

Most production RAG pipelines still run one chunker on every PDF—and wonder why retrieval misses answers that are clearly in the corpus. Adaptive Chunking (Ekimetrics, LREC 2026, MIT) treats splitting as a per-document decision: several strategies compete, five intrinsic metrics score them without labelled Q&A, and the winner is indexed—no retriever swap required.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    P[PDF or parsed doc] --> C1[Recursive 1100]
    P --> C2[Recursive 600]
    P --> C3[Page split]
    P --> C4[LLM regex split]
    C1 --> M[Five intrinsic metrics]
    C2 --> M
    C3 --> M
    C4 --> M
    M --> W[Best chunker per file]
    W --> E[Embed and index]
    E --> R[RAG query]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class P,R agent
    class C1,C2,C3,C4,E hook
    class M,W decision
PDF split strategies scored by intrinsic metrics before embedding

Why one global chunker hits a ceiling

Document types punish different splitters. Legal PDFs break on naive page splits (clauses span pages). Technical reports break on blind recursive token windows (sections and tables fragment). Sustainability and narrative filings can fail under semantic splits when size compliance collapses. You often only notice after weeks of retriever tuning—when the leak was at ingestion.

Adaptive Chunking’s thesis: measure chunk quality before downstream RAG eval, pick the best strategy per file, and only then embed.

Legal technical and sustainability PDFs need different splitters

Five label-free intrinsic metrics

MetricWhat it checks
Size Compliance (SC)Chunks within target token bounds
Intrachunk Cohesion (ICC)Sentences inside a chunk align with the chunk embedding
Contextual Coherence (DCC)Each chunk matches its surrounding context window
Block Integrity (BI)Paragraphs, tables, lists stay intact
Filtered Missing Reference Error (RC)Coreference chains (entity–pronoun) not split across chunk borders

Implementations live in adaptive_chunking/metrics.py; you can register custom scorers. Coreference scoring uses optional maverick-coref (non-commercial licence in the [coref] extra).

Default chunkers that compete

StrategyBehaviour
Recursive (1100 tokens)Split-then-merge with structural separators
Recursive (600 tokens)Finer granularity variant
Page splittingPage breaks + post-processing for size limits
LLM regexModel proposes document-specific regex boundaries
Custom callableAny splitter returning a chunk list

Measured gains on the CLAIR corpus (33 PDFs)

Evaluation spans legal, technical, and sustainability domains (~1.18M tokens). Reported downstream RAG results (Table 5, Wilcoxon p < 0.05 on retrieval completeness vs LangChain recursive baseline):

MetricAdaptive ChunkingLangChain recursivePage splitting
Retrieval Completeness67.758.159.1
Answer Correctness78.070.173.3
Answered queries65 / 9949 / 9949 / 99

Intrinsic means across domains hit 91.07% for the adaptive pick vs 88.62% for fixed recursive (Table 3). The arXiv abstract also reports answer correctness rising to ~72% from ~62–64% in an alternate experimental setup—same direction: better chunks, same model and prompts.

PDF ingestion backends

  • Docling — default open-source parser
  • PyMuPDF — lightweight local parse
  • Azure Document Intelligence — cloud layout OCR (optional)
  • Excel — supported via parsing extras

Quick start

git clone https://github.com/ekimetrics/adaptive-chunking.git
cd adaptive-chunking
pip install -e ".[parsing]"
python -m spacy download en_core_web_sm

from adaptive_chunking import chunk_files

chunks = chunk_files("path/to/pdfs/", chunk_size=600, chunk_overlap=50)
for c in chunks:
    print(c["doc_name"], c["chunk_index"], c["chunk_len"])

Paper reproduction ships 33 pre-parsed CLAIR JSON files under data/clair/ so you can rerun Tables 1–3 without re-parsing. Full RAG replication (--steps rag) needs OpenAI + GPU budget.

Context engineering, not prompt hacking

Context engineering in 2026 is largely what tokens enter the window—retrieved chunks dominate. Adaptive Chunking optimises the oldest step in that stack: segmentation. Pair it with hybrid BM25 + dense retrieval, cross-encoder reranking, and contextual embeddings (Anthropic’s contextual retrieval pattern) for a full ingestion-to-answer pipeline—but fix the silent splitter ceiling first if PDFs are heterogeneous.

Trade-offs

BenefitCost
No labelled QA to compare chunkersExtra compute at index time (multiple splits + metrics)
Modular metrics and splittersLLM-regex path needs OPENAI_API_KEY
Stronger retrieval completenessRAG eval step is API-heavy to reproduce Table 5

Summary

TakeawayDetail
ProblemOne chunker for all PDFs caps RAG quality
SolutionPer-document competition + 5 intrinsic scores
Projectekimetrics/adaptive-chunking — MIT, LREC 2026
Headline result65/99 vs 49/99 answered queries; retrieval completeness 67.7 vs 58.1
When to useMixed legal/technical/report PDF corpora before retriever sweeps

Research supplement

Categories
News

Understand Anything: Codebase Knowledge Graphs for Claude Code and AI Agents

Understand Anything is an MIT-licensed Claude Code plugin (and cross-platform installer) that runs a multi-agent analysis pipeline over your repository, writes a portable knowledge-graph.json, and serves an interactive dashboard your coding agent can query—turning “read the whole repo” into structured context engineering instead of one giant grep.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    R[Repository files] --> S[project-scanner]
    S --> F[file-analyzer parallel batches]
    F --> A[architecture-analyzer]
    A --> G[knowledge-graph.json]
    G --> D[understand-dashboard UI]
    G --> C[Agent slash commands]
    C --> Q[Claude Code Codex Antigravity]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class R,G agent
    class S,F,A hook
    class D,C,Q hook
Multi-agent scan to knowledge graph and agent dashboard

What problem it solves

Large codebases punish both humans and agents: flat file search returns snippets without call chains, layers, or business meaning. Understand Anything’s pitch—“graphs that teach, not graphs that impress”—means the output is meant for onboarding and agent grounding: nodes for files, functions, classes, and dependencies; optional domain view mapping code to business flows; guided tours ordered by dependency; and diff impact before you merge.

Multi-agent pipeline under /understand

AgentRole
project-scannerDiscover files; detect languages and frameworks (26+ file types including infra/docs)
file-analyzerExtract symbols, imports, edges; build graph nodes (parallel batches, up to 5 concurrent)
architecture-analyzerLabel layers (API, service, data, UI, utility, …)
tour-builderGenerate guided walkthroughs
graph-reviewerValidate completeness and referential integrity
domain-analyzerBusiness domains, flows, steps (/understand-domain)
article-analyzerWiki/knowledge-base entities and claims (/understand-knowledge)

Output lands in .understand-anything/knowledge-graph.json. The project recommends committing that JSON (with .gitattributes) so teammates skip re-running the full pipeline—useful for onboarding and PR review. Incremental runs re-analyse only changed files.

Comparison of agent context with and without codebase graph

Commands agents and humans share

CommandPurpose
/understandRun the scan and build the graph
/understand-dashboardOpen interactive graph UI (pan, zoom, search, layers)
/understand-chatNatural-language Q&A over the graph (“How does payment work?”)
/understand-diffImpact analysis for current changes
/understand-explainDeep-dive a file or symbol
/understand-onboardGenerate onboarding guide
/understand-domainBusiness-domain graph view
/understand-knowledgeKarpathy-style LLM wiki → knowledge graph

Platform support

Claude Code (native):

/plugin marketplace add Lum1104/Understand-Anything
/plugin install understand-anything
/understand
/understand-dashboard

Codex, Antigravity, Gemini CLI, OpenCode, Cursor, Copilot, Cline, KIMI, and others: one-line installer clones to ~/.understand-anything/repo and symlinks platform config—pass codex, antigravity, gemini, vscode, etc. Cursor and VS Code Copilot can auto-discover via bundled plugin manifests when the repo is present.

# macOS / Linux example
curl -fsSL https://raw.githubusercontent.com/Lum1104/Understand-Anything/main/install.sh | bash -s antigravity

Context engineering for coding agents

“Context engineering” here means giving the agent a stable, queryable map instead of rediscovering structure every session:

  • Structural graph — who calls whom, layers, dependency paths between components
  • Semantic search — fuzzy and meaning-based lookup across nodes
  • Domain overlay — auth flows, payment pipelines, user lifecycle as first-class views
  • Diff impact — ripple effects before commit
  • Portable artefact — JSON graph usable offline in the dashboard without re-calling the LLM for every pan/zoom

That complements agent harness features (Claude Code tools, MCP servers, compaction)—you still need permissions and tool discipline, but the agent spends fewer turns on orientation.

How it compares to other graph approaches

ApproachStrengthTrade-off
Understand AnythingPlugin + dashboard + domain tours; multi-platform installLLM cost for initial /understand scan
MCP graph servers (e.g. CodeGraphContext)Live graph DB, Cypher/call-chain queries in the IDEOps for DB backend; less bundled UI storytelling
Prompt-injected maps (e.g. codebase-graph)Always-on compressed map in system promptNo interactive domain layer; schema-focused

Summary

TakeawayDetail
ProductLum1104/Understand-Anything — MIT, ~21k+ GitHub stars
Core loopMulti-agent scan → knowledge-graph.json → dashboard + slash commands
Agent fitClaude Code, Codex, Antigravity, Cursor, Copilot, Gemini CLI, …
Best forOnboarding, architecture tours, diff impact, domain-aware exploration
Live tryunderstand-anything.com/demo

Research supplement

Web search was unavailable in this environment; no externally sourced claims could be verified during this session. The following points are flagged for editorial fact-checking before publication:

  • GitHub star count — The article cites "~21k+ GitHub stars" for Lum1104/Understand-Anything. This should be confirmed directly on the repository page, as star counts move quickly and the "+" notation may reflect rounding.
  • Platform installer compatibility — The claim that the installer supports Codex, Antigravity, Gemini CLI, OpenCode, Cursor, Copilot, Cline, and KIMI should be spot-checked against the repository's install.sh and README, as multi-platform support tables can lag behind actual implementation.
  • LLM cost on large repos — No token-count or dollar-cost benchmarks are given in the article. A useful addition would be a real-world scan cost example (e.g., tokens consumed for a 100k-line Python repo).
  • understand-anything.com/demo — The article links to a live demo. Editors should verify the demo is accessible and representative before the article is widely circulated.
Categories
News

Claude Code Architecture Explained: Six Harness Layers Beyond the LLM

Claude Code is not “a CLI that calls Claude.” It is an agentic harness: a Node.js runtime that wraps the Claude model with permissions, memory, tools, compaction, MCP integrations, subagents, and lifecycle hooks. The model reasons; the harness mediates every action. That split is why the product can feel magical while remaining debuggable once you map the layers.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    U[Your prompt] --> I[Input layer]
    I --> K[Knowledge layer]
    K --> L[Agent loop]
    L --> M[Claude model API]
    M --> E[Execution layer tools]
    E --> L
    L --> N[Integration MCP plugins]
    L --> O[Observability hooks]
    L --> A[Multi-agent subagents teams]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class U,M agent
    class I,K,L agent
    class E,N,O,A hook
Diagram of six Claude Code harness layers around central agent loop

Why the harness matters more than the model alone

Claude Code and Claude are separate software: the CLI runs locally (or in a managed surface), while inference runs on Anthropic’s Messages API. Each turn, the harness assembles system instructions, tool schemas, conversation history, CLAUDE.md, skills metadata, and permission state—then streams tool calls back into the loop until Claude stops requesting tools. Community architecture diagrams often show six layers around a central loop; the table below maps those layers to what Anthropic documents today.

LayerWhat it doesOfficial building blocks
1. InputSession boundary, trust, approval policySessions, permission modes, permission rules, project trust, layered .claude/settings.json
2. KnowledgePersistent instructions and context survivalCLAUDE.md, auto memory, Skills (SKILL.md), compaction, path-scoped .claude/rules/
3. ExecutionTool dispatch and the agentic loopBuilt-in tools (Read, Edit, Write, Bash, Grep, Glob, …), streaming turns, parallel read-only tools, prompt caching on stable prefixes
4. IntegrationExternal systems and packaged extensionsMCP servers, plugins, optional channels (event-driven surfaces)
5. Multi-agentDelegated work without blowing the main contextSubagents (Agent tool), agent teams (experimental), git worktree isolation
6. ObservabilityDeterministic control and audit pointsHooks (PreToolUse, PostToolUse, Stop, SubagentStart, PreCompact, …), checkpoints, session JSONL under ~/.claude/projects/
Circular diagram: gather context, take action, verify results

The central agent loop (the “dumb” part on purpose)

Anthropic describes a deliberately simple cycle: gather context → take action → verify results, repeated until the task completes. The Agent SDK exposes the same loop programmatically: each turn is one model response that may include tool calls; the harness executes tools, appends results, and continues until a text-only finish or a budget/turn limit.

PhaseTypical toolsHarness role
Gather contextRead, Grep, Glob, web fetch/searchInject CLAUDE.md, defer heavy MCP schemas via tool search
Take actionEdit, Write, Bash, MCP actionsPermission checks, sandbox, checkpoints before edits
VerifyRe-run tests, read linter output, ask userStop hooks, max_turns / max_budget_usd in SDK

Layer 1 — Input: sessions and permission gating

Before any model call, Claude Code establishes where it runs (project directory, worktree, cloud sandbox) and what it may do. Permission modes (default, acceptEdits, plan, auto, dontAsk, bypassPermissions) set the baseline; fine-grained rules like Bash(npm test) or Read(./src/**) match specific tool invocations. Project trust gates whether project-local hooks and MCP configs execute—important for supply-chain safety on unfamiliar repos.

Layer 2 — Knowledge: memory outside the weights

Harness intelligence lives largely here. CLAUDE.md reloads after compaction so persistent conventions survive long sessions. Auto memory lets Claude write project notes under ~/.claude/projects/. Skills package procedural expertise in SKILL.md files (Agent Skills open standard). When the context window fills, compaction summarizes older turns—older tool outputs cleared first—while CLAUDE.md and memory files stay authoritative if you put rules there rather than only in chat.

Layer 3 — Execution: tools, streaming, cost control

Tools are the agentic difference: each tool_use block becomes a real side effect (file read, patch, shell command). Read-only tools may run in parallel; mutating tools run sequentially to avoid races. Stable system prompts and tool definitions benefit from prompt caching (~90% cheaper cache reads on repeated prefixes per Anthropic’s caching docs—often cited as “10% cost” for cache hits in harness discussions). Undo paths use checkpoints (per-prompt file snapshots), not a separate “revert” tool in the public tool reference.

Layer 4 — Integration: MCP and plugins

MCP registers external capabilities (databases, browsers, ticketing systems) as named tools—typically prefixed mcp__server__action. MCP Tool Search loads schemas on demand so idle servers do not dominate context. Plugins bundle skills, hooks, subagents, and MCP config for repeatable team rollouts. This layer is how the same harness lands in finance, life sciences, or internal platforms without forking the core CLI.

Layer 5 — Multi-agent: subagents, teams, worktrees

Subagents run in isolated context windows and return summaries—so a research pass does not dump thousands of tokens into the parent thread. Built-in Explore and Plan subagents ship for codebase search and design-only passes. Agent teams (experimental; env flag required) coordinate multiple independent sessions with a shared task list—stronger separation than subagents, which report only upward. Worktree isolation (-w) keeps parallel agents on separate branches under .claude/worktrees/ to prevent file clashes.

Layer 6 — Observability: hooks and lifecycle events

Hooks fire at fixed lifecycle points—unlike skills, they are deterministic, not model-chosen. Use PreToolUse to block destructive commands, PostToolUse to format or log, PreCompact to archive transcripts before summarization, and SubagentStop to aggregate parallel results. Handler types include shell commands, HTTP webhooks, MCP tools, prompt judges, and experimental agent-based verifiers. Hooks run in your process and do not consume model context.

Agent SDK: same loop, programmatic control

The Claude Agent SDK exports the production loop for CI, services, and custom UIs: configure allowed_tools, max_turns, max_budget_usd, effort, and setting_sources to load project CLAUDE.md/skills/hooks. Result messages expose subtype (success, error_max_turns, error_max_budget_usd, …) plus per-session cost—making budget-aware agents an engineering task, not a hope.

# Minimal SDK pattern (Python)
from claude_agent_sdk import query, ClaudeAgentOptions

async for message in query(
    prompt="Fix failing auth tests",
    options=ClaudeAgentOptions(
        allowed_tools=["Read", "Edit", "Bash", "Grep", "Glob"],
        setting_sources=["project"],
        max_turns=30,
    ),
):
    ...  # handle AssistantMessage, ResultMessage

Practical takeaways for builders

  • Invest in the harness — tool descriptions, permissions, CLAUDE.md, and hooks often beat prompt tweaking.
  • Default to workflows when steps are known; use full agent loops only when exploration is required (same guidance as Anthropic’s effective agents post).
  • Think inside the context window — if you would be lost with only the last screenshot and tool output, the model will be too.
  • Measure turns and dollars — set SDK budgets early for production agents.

Summary

PointDetail
What Claude Code isAgentic harness around Claude models, not a monolithic “coding LLM”
Six-layer viewInput, knowledge, execution, integration, multi-agent, observability—loop in the middle
Core loopContext → action → verify; tools chain until done
ExtensionsSkills (expertise), MCP (connectivity), hooks (policy), subagents (isolation)
Where to read moreGlossary, tools reference, agent loop

Research supplement

The following sources from Anthropic's official documentation corroborate and extend the article's six-layer framework. No URLs were retrieved via live search in this session; the references below point to Anthropic's publicly documented pages that builders should consult to verify figures (such as prompt-caching cost ratios) and to track the experimental status of agent teams.

  • Prompt caching: Anthropic's caching documentation describes cache-read pricing as a fraction of standard input-token cost. The article's "~90% cheaper" figure aligns with the published cache-read multiplier, but readers should verify against the current Anthropic prompt caching docs before citing in cost models, as pricing tiers can change.
  • Claude Agent SDK: The SDK's ClaudeAgentOptions parameters (allowed_tools, max_turns, max_budget_usd, setting_sources) are documented in the Claude Code SDK reference. The ResultMessage subtypes (error_max_turns, error_max_budget_usd) are described there and are important for production error handling.
  • Hooks reference: The full list of lifecycle events (PreToolUse, PostToolUse, Stop, SubagentStop, PreCompact) and supported handler types (shell, HTTP webhook, MCP tool, prompt judge) is in the Claude Code hooks documentation.
  • Agent Skills open standard: SKILL.md packaging is described in the Claude Code skills documentation. The article's claim that this is an "open standard" is worth verifying — readers should check whether a formal spec has been published outside Anthropic's own docs.
  • MCP Tool Search and deferred schemas: The pattern of loading MCP tool schemas on demand (to avoid context saturation from idle servers) is described in Claude Code's MCP integration guidance at the MCP docs.
---

References

Categories
News

When Not to Build AI Agents: Anthropic’s Workflow-vs-Agent Playbook

Most production AI systems do not need another autonomous agent—they need a workflow with clear steps, tight tools, and measurable outcomes. That is the practical message from Anthropic’s agent-infrastructure team in the Building effective agents guide and the ~14-minute summit talk How We Build Effective Agents: start simple, add agency only when flexibility outweighs latency, cost, and error compounding.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    Q[New AI feature] --> T{Steps predictable?}
    T -->|Yes| W[Workflow fixed code path]
    T -->|No| A{Open-ended + trusted tools?}
    A -->|Yes| G[Agent model loop]
    A -->|No| W
    W --> N[Nodes: route chain parallel evaluate]
    G --> E[Environment + tools + prompt]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class Q,G,E agent
    class W,N hook
    class T,A decision
Ladder: single LLM call, workflow, agent

Task, workflow, or agent?

Anthropic separates agentic systems into two architectures. Workflows orchestrate LLM and tool calls through predefined code—you own the control flow. Agents let the model choose its next tool call from environment feedback—you own the goal and guardrails, not every branch.

PatternWho controls the pathBest when
Single augmented LLM callDeveloper prompt + retrieval/toolsOne-shot Q&A, classification, draft
Prompt chainingFixed sequence of callsDecomposable steps (outline → write → check)
RoutingClassifier picks specialist branchDistinct ticket types or model tiers
ParallelizationSectioning or voting across callsGuardrails, multi-aspect review
Orchestrator–workersCentral model delegates subtasksUnpredictable sub-steps (multi-file code)
Evaluator–optimizerGenerate/critique loopClear rubric and iterative refinement
AgentModel loop with toolsOpen-ended goals, verifiable feedback (tests, env state)

Why “don’t build agents for everything”

Agents trade predictability for flexibility. Each extra autonomous turn adds latency, token cost, and the chance that an early mistake propagates. Anthropic’s guidance: use the simplest pattern that passes evaluation—often a workflow or even one well-tooled LLM call—and reserve agents for cases where you cannot hardcode the path but can still verify progress (customer support with tool-backed actions, coding agents with tests, computer-use demos with screenshots as ground truth).

Workflow winsAgent wins
Repeatable business processUnknown step count up front
Auditability and fixed SLAsRich tool feedback each turn
Lower cost per requestSWE-bench-style multi-file edits
“Autonomy” is not the goal—reliability isSandboxed testing + stop conditions (max iterations)

The three-component agent model

When an agent is justified, Anthropic collapses design to three decisions—everything else (caching trajectories, parallel tool calls, UI progress) is optimisation after behaviour works:

  • Environment — the world the agent sees (repo, browser, ticket queue).
  • Tools — documented, poka-yoke interfaces; spend more time here than on clever prompts (SWE-bench work reportedly optimised tools before prompts).
  • System prompt — goals, constraints, and stop rules.

Multiple shipped agents (coding, computer use, search) share the same backbone; only environment, toolset, and prompt change.

Environment, tools, system prompt into model loop

Think like your agent

The summit talk’s most actionable habit: reason inside the agent’s context window (~10–20k tokens per step), not your human priors. A computer-use agent may only receive a static screenshot and a terse task description—then “click” while inference runs, equivalent to using the machine blind for several seconds. If that feels fragile, the fix is richer observations (resolution, suggested actions, guardrails), not more orchestration layers.

Practical checks: paste the system prompt into the model and ask what is ambiguous; feed trajectories and ask why a step failed; run a full task while limiting yourself to the agent’s observations. Anthropic also stresses agent–computer interfaces: tool formats should match what models have seen (absolute paths, minimal JSON escaping overhead, examples in tool definitions).

Frameworks, MCP, and Skills (adjacent layers)

The same organisation recommends starting with direct LLM APIs when possible—frameworks hide prompts and encourage over-engineering. For connectivity, Model Context Protocol (MCP) standardises third-party tools. Agent Skills (portable SKILL.md packages) address a different gap: domain expertise on top of a general agent—procedural knowledge loaded on demand rather than spinning up a new bespoke agent per vertical. Skills complement agents; they do not replace the workflow-first discipline.

Open problems the team is watching

  • Budget-aware agents — enforce caps on time, money, or tokens (workflows already offer this; agents lag).
  • Self-evolving tools — models that refine tool descriptions/ergonomics from usage.
  • Multi-agent coordination — moving beyond rigid user/assistant turns toward asynchronous agent-to-agent roles without blowing up context.

Summary

TakeawayAction
Label the buildDeclare task vs workflow vs agent before procurement
Default simpleComposable workflow patterns from the Anthropic cookbook
Agent minimalismEnvironment + tools + prompt first; optimise later
Debug empatheticallyInspect trajectories as the model sees them
Production barSandbox tests, stop conditions, human checkpoints—not novelty autonomy

Research supplement

Primary source: The article synthesizes Anthropic's publicly available Building effective agents guide (December 2024), which formally defines the six workflow patterns and the agent architecture discussed above. The guide explicitly states: "When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed."

Model Context Protocol: MCP, cited in the article as a connectivity standard, was announced by Anthropic in November 2024. The official announcement describes it as an open standard enabling LLMs to connect to external data sources and tools through a unified interface, with early adopters including Block and Apollo at launch.

SWE-bench benchmark: The article references SWE-bench-style multi-file editing as a canonical agent use case. The SWE-bench leaderboard tracks coding-agent performance on resolving real GitHub issues; it is the primary benchmark referenced when Anthropic discusses tool optimization for coding agents.

---

References