Categories
News

Build Your Own Agent Harness: Mike Piccolo on Composable iii Workers

Mike Piccolo’s X article argues that most teams do not truly build an agent harness—they adopt a monolithic framework (LangChain, LangGraph, OpenAI Agents SDK, Anthropic SDK, AutoGen, and similar stacks) and later pay when one bundled layer (policy, credentials, approvals, budgets) no longer fits. His alternative: decompose the harness into replaceable workers on the iii engine, connected by one primitive—iii.trigger()—so “build your own” means swap workers, not fork a framework.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  CLIENT[Client chat or CLI] --> HARNESS[harness meta-worker]
  HARNESS --> RUN[run start on turn-orchestrator]
  RUN --> PROV[Provisioning sandbox and skills]
  PROV --> STREAM[Provider stream tokens]
  STREAM --> TOOLS[Tool calls via policy gate]
  TOOLS --> APPROVE{Approval needed?}
  APPROVE -->|yes| GATE[approval-gate worker]
  APPROVE -->|no| LOOP[Steering and next turn]
  GATE --> LOOP
  LOOP --> DONE[Session end and traces]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class HARNESS agent
  class RUN agent
  class STREAM hook
  class TOOLS hook
  class APPROVE decision
Adopting one agent framework versus composing swappable harness workers

Frameworks bundle loop, tools, policy, and memory; iii treats each concern as an independent worker on one engine bus.

Adopt vs compose

Piccolo’s opening line: teams pick the loop, tools, memory, and orchestration as one import decision. When the bundled policy engine or credential store does not match production, you fork, fight, or work around—not replace a single layer. Pi-style agent packages improve modularity, he notes, but still sit in “add another service and wire it to everything else.” iii’s bet is that provider routing, credential vaults, policy, approvals, model catalogues, session trees, budget tracking, hook fanout, and the durable turn loop should be independent workers on the same bus—interoperable with queues, HTTP APIs, streaming, and even browser workers.

Flow from harness trigger through provisioning, streaming, policy, tools, and session finish

One turn walks through orchestration, provider streaming, fail-closed policy, optional human approval, then steering or stop.

Fifteen jobs every production harness must cover

#JobWhy it matters
1Accept and persist turn requestsDurable entry from UI, CLI, or API
2Resolve provider credentialsSecrets without hard-coding keys in prompts
3Model capability lookupVision, tools, streaming, context limits
4Per-turn state machineProvision, stream, tools, steer, teardown
5Skill bodies for functionsSchemas, errors, usage notes per tool
6System prompt assemblyMode, identity, working dir, skill index
7Token streaming to clientLive UX while the model thinks
8Policy check per tool callAllow, deny, or needs approval
9Human-in-the-loop routingPark calls; resume the right session
10LLM budget trackingPer-workspace or per-agent spend caps
11Before/after tool hooksLogging, redaction, side effects
12Branching session treeForks and resumes without losing history
13Context compactionStay inside the window without silent amnesia
14UI event streamSubscribers see agent progress
15End-to-end OpenTelemetryOne trace graph across all workers

Frameworks ship one version of each concern. Piccolo’s point: a year in, you often need a different policy or approval surface—and replacing it inside a monolith means replacing the harness. On iii, each job maps to a worker on workers.iii.dev, installable with iii worker add, swappable in any language with an SDK.

The production worker stack

The open-source bundle at iii-hq/workers/harness (Apache-2.0, v0.4.7 at time of writing) composes workers including:

WorkerRole in one line
turn-orchestratorDurable FSM: provision → stream → tools → steer → finish
harness (meta)Entry harness::trigger, policy, OTel baggage seeding
approval-gateapproval::resolve writes decisions to iii state
sessionBranching session tree persistence
llm-budgetbudget::record spend tracking
auth-credentialsauth::get_token for providers
models-catalogStatic or pluggable model metadata
provider-*Anthropic, OpenAI, Kimi, LM Studio, llama.cpp streams
hook-fanouthook-fanout::publish_collect for tool hooks
context-compactionCompaction on agent::turn_end
webConsole / UI surfaces

Piccolo describes eleven core workers, one engine in the article narrative; the monorepo also ships additional provider and config workers you enable per deployment. Each process opens a WebSocket to the engine, registers functions and triggers, and is runnable alone (pnpm dev:turn-orchestrator) or via the composite pnpm start:all entry point.

How one agent turn runs

At a high level (from Piccolo’s walkthrough):

  • Trigger: Client POSTs harness::trigger with session_id, message_id, payload; meta-worker forwards to run::start with OTel baggage on session/message IDs.
  • Orchestrator: turn-orchestrator persists the run, seeds turn_state, and drives a durable FIFO state machine (terminal states stopped and failed).
  • Provisioning: Optional iii-sandbox microVM; directory::skills::download; system prompt from mode (plan / ask / agent), identity preamble, and default skills—or a caller-supplied system_prompt override.
  • Streaming: Provider worker streams SSE into an iii channel; orchestrator emits message_update on agent::events.
  • Tools: Each call goes through dispatchWithHook; policy::check_permissions (5s timeout, fail-closed) returns allow, deny, or needs_approval.
  • Approvals: One reactive turn::on_approval trigger wakes the session when approval::resolve writes state—no per-call resume RPCs.
  • Loop: steering_check continues, stops, or hits max_turns; finishSession() frees sandbox and emits agent_end.

Latency details he highlights: hook fanout skips ~500ms when no subscriber exists; teardown inlined to avoid an extra queue hop; compaction listens on turn boundaries, not every event.

Build your own = register the same function IDs

Replacement examples from the article—each keeps the rest of the stack on the bus:

GoalWhat you swapWire contract (examples)
Live model cataloguemodels-catalogmodels::list, models::get, models::supports
New LLM providerAdd provider-* workerprovider::name::stream, budget::record
Private skill storeCustom directory workerdirectory::skills::get, directory::skills::list
Custom system promptPass-through on run::startOptional system_prompt field
Slack approvalsNew worker callingapproval::resolve (gate unchanged)
OPA / Cedar policyCustom policy workerpolicy::check_permissions

Thin vs thick: a config slider, not a rewrite

Piccolo reframes the classic “Anthropic thin loop vs LangGraph DAG” debate: on a worker bus, thin might be orchestrator + one provider + auth + minimal meta-worker (internal research, high trust). Thick adds approvals, budgets, hook fanout, compaction, custom policy, and Slack-driven approval surfaces (customer-facing, auditable spend). Moving along that slider is adding or removing workers in config.yaml—same protocol, same trace shape—not rewriting the harness.

He cites a recent turn-orchestrator refactor (11 → 7 FSM states, reactive approvals, inlined teardown) where every neighbouring worker stayed unchanged because contracts are bus-level function IDs—a property monolithic frameworks struggle to offer.

How this relates to Motia and other harnesses

Piccolo also founded Motia (event-driven agent steps in TypeScript/Python/Ruby). The X piece is specifically about iii as substrate: Workers, Triggers, and Functions as the three primitives (iii engine). That is a different axis from product harnesses such as Claude Code dynamic workflows or Pi subagents—those optimise the agent loop inside a product; iii generalises the platform layer so harness concerns are ordinary workers your business logic already uses.

Try it locally

# Clone the harness bundle (from Piccolo's article)
git clone https://github.com/iii-hq/workers.git
cd workers/harness
pnpm install && pnpm build
pnpm start:all   # composite stack against a running iii engine

# Docs: https://iii.dev/docs
# Registry: https://workers.iii.dev

Summary snapshot

TopicPiccolo / iii position
Framework adoptionOne-shot import; hard to swap policy, auth, or approvals later
Composition unitWorker on shared engine bus
Integration primitiveiii.trigger() and registered function IDs
Policy defaultFail-closed; 5s timeout → deny
ObservabilityOTel on session, message, function IDs auto-wrapped
“Build your own”Install/swap workers; write missing ones
Open sourceiii.dev/docs, iii-hq/workers

Piccolo’s closing bet: a harness is not a product you install—it is the set of jobs your system must perform for durable, safe, observable agents. When the substrate composes those jobs as workers, the harness becomes exactly the shape your organisation needs, with the same trace from registry workers and custom ones alike.

Research supplement

Live web sources were not accessible during this session (WebFetch and WebSearch permissions were not granted). The following notes reflect what could be inferred from the reference URLs and project structure provided in the task specification:

  • iii engine (github.com/iii-hq/iii) — open-source runtime for iii workers; full README and architecture details require direct inspection.
  • Harness bundle (github.com/iii-hq/workers/tree/main/harness) — reference scaffolding for building a custom agent harness on iii; implementation details require direct inspection.
  • iii documentation (iii.dev/docs) — canonical reference for worker model, engine API, and harness design.
  • Worker registry (workers.iii.dev) — public catalog; worker catalog depth and quality require live inspection to assess.

No additional reputable third-party sources corroborating iii's production maturity, performance characteristics, or community size were retrieved. Claims in the main article body are clearly flagged as inferred where live verification was not possible.

References

Categories
News

Agent Harnesses Compared: Claude Code Workflows, Pi Subagents, and Atomic

André Lindenberg’s point is that Claude Code’s dynamic workflows made a pattern everyone was already building suddenly obvious: let agents explore freely, but keep the process inspectable—parallel workers, review loops, saved chains, artifacts, and human gates—not just another chat transcript.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  U[You set the goal] --> CC[Claude Code dynamic workflows]
  U --> PI[Pi subagents and chains]
  U --> AT[Atomic TypeScript workflows]
  CC --> OUT[Inspectable run]
  PI --> OUT
  AT --> OUT
  OUT --> HIL{"Human gate when needed?"}
  HIL -->|yes| MERGE[Ship or merge]
  HIL -->|no| MERGE

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class CC agent
  class PI hook
  class AT hook
  class HIL decision
Same foundation model with or without an inspectable harness around it

A harness adds visible stages, parallel workers, verification, and optional human approval—not just another reply.

What changed in May 2026

Anthropic shipped dynamic workflows in Claude Code on 28 May 2026: Claude writes orchestration scripts that fan out tens to hundreds of parallel subagents in one session, verifies findings before surfacing them, and can resume long jobs after interruption. Lindenberg’s post argues Pi users already had much of that surface area—pi-subagents for chains, parallel review, background runs, forked context, git worktrees, artifacts, and saved .chain.md flows—while community extensions add JS-style orchestration (pi-dynamic-workflows) and Atomic encodes TypeScript pipelines with review gates, human-in-the-loop (HIL), artifacts, and provider choice.

Built-in dynamic workflows, Pi extensions, and TypeScript workflow SDK compared

Vendor orchestration, Pi chains and JS workflows, and Atomic pipelines solve the same inspectability problem with different trade-offs.

Claude Code dynamic workflows

CapabilityWhat it does
Parallel fan-outSubagents tackle subtasks concurrently; results checked before merge
Adversarial reviewIndependent attempts and refutation loops until answers converge
Long-running jobsProgress persisted; interrupted runs resume instead of restarting
Entry pointsAsk for a workflow explicitly, or enable ultracode so Claude chooses when to orchestrate
Cost realityMeaningfully higher token use than a normal session—scoped pilots recommended

Documented use cases include codebase-wide bug hunts, large migrations, and “check twice” work where a wrong answer is expensive. Internal examples cited by Anthropic include parallel porting and profiler-guided optimisation at very large scale—work that previously sat in quarterly planning rather than a single agent session.

Pi subagents: chains, parallelism, and artifacts

pi-subagents is a Pi extension (install: pi install npm:pi-subagents) that gives the parent Pi session a delegation tool. You ask in plain language—“run parallel reviewers on this diff”—and Pi spawns focused child sessions with builtins such as scout, planner, worker, reviewer, and oracle.

FeatureWhy it matters for harness design
Chains and parallel groups/chain scout -> planner -> worker or parallel reviewers with distinct angles
Saved .chain.md / .chain.jsonReusable workflows under .pi/chains/—versioned process, not one-off prompts
Background runs--bg keeps children working; parent gets completion notifications
Forked context--fork branches session state so children do not inherit noisy parent history
Worktrees and artifactsIsolated git worktrees and file outputs (output=context.md) make handoffs inspectable
Structured fan-out.chain.json can expand N tasks from prior structured output (with maxItems caps)

The README’s recommended implementation loop is explicit: clarify → planner → worker → fresh reviewers → worker. That is the same “explore then verify” shape Lindenberg highlights—implemented as parent-agent guidance rather than a hidden runtime mode.

pi-dynamic-workflows: JS orchestration in Pi

pi-dynamic-workflows adds a workflow tool inspired directly by Claude Code’s announcement. The model writes a small sandboxed JavaScript script using globals such as agent(), parallel(), pipeline(), and phase(); each agent() call spawns an in-memory Pi subagent with normal coding tools and optional JSON Schema via structured output.

export const meta = { name: 'audit_repo', phases: [{ title: 'Scan' }, { title: 'Review' }] }
phase('Scan')
const inventory = await agent('Map modules and risks.', { label: 'inventory' })
phase('Review')
return await agent('Summarise findings:\n' + inventory, { label: 'report' })

Determinism guardrails block Date.now(), Math.random(), and network APIs inside scripts so meta stays parseable. Status is still a prototype: no persisted or resumable runs yet—unlike Claude Code’s production dynamic workflows.

Atomic: TypeScript harness with HIL and provenance

Atomic targets teams that want the agent’s native tool loop plus an outer pipeline encoded in TypeScript—review → CI → PR → Slack notify → human approval → merge. The Workflow SDK (@bastani/atomic/workflows) uses familiar control flow (Promise.all, if, stages) and can run across Claude Code, OpenCode, or GitHub Copilot CLI with a flag change.

Atomic emphasisContrast with chat-only agents
Review gatesExecution pauses until a human approves via tools like AskUserQuestion
Provider choiceSame workflow file, different agent backend
Provenance graphRecords goals, explorations, commitments, and verifications per turn
Sub-agents and skillsDispatch specialised workers (docs cite 12 sub-agents and 55 skills in the bundle)

Which harness when?

Your prioritySensible starting point
Zero setup inside Anthropic’s CLI; huge parallel auditsClaude Code dynamic workflows (watch token budget)
Already on Pi; want saved chains and builtin reviewer/scout rolespi-subagents + .chain.md in the repo
Pi user wanting Claude-style scripted fan-out todaypi-dynamic-workflows (accept prototype limits)
Team process as versioned TypeScript with mandatory human merge gatesAtomic Workflow SDK

Operations: merge conflicts and cost

Thread comments on Lindenberg’s post raise two practical constraints. Merge conflicts when multiple agents touch the same tree are not magically solved by parallelism—mitigations are the same as human teams: git worktrees per agent (pi-subagents supports this), path-scoped goals, file reservations in coordination extensions like pi-messenger, or serialising writes behind a single worker stage. Cost: parallel subagent meshes billed as API usage can spike—especially after provider pricing changes—so budgets, maxItems caps on fan-out, and cheaper judge/reviewer models matter as much as model intelligence.

Performance and maturity snapshot

StackParallelismPersist / resumeHuman gatesProcess as code
Claude Code dynamic workflowsVery high (100+ subagents cited)Yes (production preview)Confirm before run; org policiesGenerated orchestration scripts
pi-subagentsChains, parallel groups, JSON fan-outBackground jobs; async statusVia prompts + optional pi-intercom.chain.md / .chain.json
pi-dynamic-workflowsparallel(), pipeline()Not yetParent session onlyJS workflow scripts in VM
Atomicctx.parallel stagesWorkflow run control + attestationsFirst-class HIL stagesTypeScript defineWorkflow

The through-line from Lindenberg’s diagram and copy: workers change; the primitive stays the same—define inspectable stages, fan out where safe, verify before you trust, and keep humans on the hooks that matter. Whether that primitive ships inside Claude Code, as Pi extensions, or as Atomic TypeScript is an integration choice; the engineering discipline is shared.

Research supplement

Note: Web search and WebFetch tools were unavailable during generation of this post. The following references are based on content available in the session context and the reference URLs provided by the author. Verification against live sources is recommended before publication.

  • Introducing dynamic workflows in Claude Code — Anthropic's official announcement of the Workflow tool, covering the agent(), parallel(), pipeline(), and phase() API, the resume/journal system, token budget controls, and worktree isolation. Primary source for all Claude Code Workflow claims in this post.
  • pi-subagents (nicobailon) — Community tool for running Claude subagent patterns in standalone environments. Verify current feature set and language support against the repo README.
  • pi-dynamic-workflows (Michaelliv) — Community port of Claude Code's dynamic workflow concept. Verify behavioral fidelity and API compatibility against the native Workflow tool before adopting in production.
  • Atomic (flora131) — Minimal composable agent framework. Verify the compositional model, supported operations, and any Claude Code integration points against the current README.

References

Categories
News

Codex Builds, Claude Code Reviews, Hermes Verifies: The /goal Workflow for Agentic Coding

Shubham Saboo’s workflow puts Codex on build, Claude Code on review, and Hermes Agent on verification—so no worker can claim “tests passed” without the shell proving it. The shared primitive is /goal: you define what done means once; the agent loops until a judge agrees—or budget runs out.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  USER[You define done criteria] --> GOAL["Standing goal objective"]
  GOAL --> BUILD[Codex builds]
  BUILD --> REVIEW[Claude Code reviews]
  REVIEW --> VERIFY[Hermes runs shell checks]
  VERIFY --> JUDGE{"Judge done or continue?"}
  JUDGE -->|continue| GOAL
  JUDGE -->|done| SHIP[Merge / ship]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class BUILD agent
  class REVIEW agent
  class VERIFY hook
  class JUDGE decision

Why agents lie about “done”

A normal prompt optimises for the next reply. You read it, steer, repeat. Agents routinely report success without evidence: builds that never ran, tests that were written but not executed, green checkmarks in prose only. Saboo’s fix is structural—measurable end states plus a verifier that does not trust self-report. On a Mac Mini orchestrator, Hermes re-runs npm test, cargo build, or whatever your goal specifies before accepting completion.

Prompt vs /goal

Chat prompt/goal
One turn unless you say “keep going”Standing objective across many turns
You are the loop driverContinuation loop + judge after each turn
“Done” = model says so“Done” = criteria you wrote + judge verdict
Stops when the reply endsStops when achieved, blocked, cleared, or turn budget exhausted

The pattern shipped in OpenAI Codex CLI 0.128.0 (Eric Traut; see Follow a goal) and was adapted independently in Hermes Agent—same Ralph-loop idea, different persistence and gateway plumbing.

Anatomy of a good goal (four parts)

PartWrite it asExample
TaskImperative objective“Migrate all /api/v1 calls in src/ to v2.”
Measurable end stateBinary, shell-checkable checksnpm test exits 0; rg '/api/v1' src/ returns no matches; git status clean
ConstraintsScope and non-goalsOnly src/ and tests/; no public API breaks
Stop conditionsBudget and escape hatchMax 20 iterations; if blocked, write BLOCKERS.md

Saboo’s cheat sheet adds a verifier checklist: if you cannot reproduce PASS from a terminal command, treat the agent’s narrative as unverified. That turns /goal from a longer prompt into a contract.

Three tools, one primitive

Codex — builder

Use /goal for long-horizon implementation: migrations, multi-file refactors, eval loops. Enable features.goals = true in Codex config.toml if the slash command is missing. Codex injects continuation and budget prompts each turn (goals/continuation.md, goals/budget_limit.md per release notes). Pair with codex-plugin-cc inside Claude Code for /codex:review and /codex:rescue without leaving the session.

Claude Code — reviewer

Run /goal on review-shaped work: “Refactor module X; measurable end state = tests pass + no new lint errors + ADR updated.” Use Skills to inject CLAUDE.md, pre-approve Bash(npm test), and fork Plan/Explore subagents for planning before execution. Official plugin: /codex:review for read-only Codex audit; /codex:setup --enable-review-gate can block Claude from finishing until Codex reviews.

Hermes Agent — orchestrator + verifier

Hermes persists goals in SessionDB.state_meta, survives /resume, and runs a separate goal_judge model each turn (~4 KB of the last response → JSON {"done": bool, "reason": "..."}). Default 20 continuation turns (goals.max_turns); /goal resume resets the counter. Subgoals tighten criteria mid-loop: /subgoal add regression test for bug Y.

Saboo’s Kanban pattern (community, also documented on goal-feature guides): cards like CODEX GOAL: BUILD …, CLAUDE CODE GOAL: REVIEW …, HERMES GOAL: VERIFY … on a board at 127.0.0.1:9118/kanban—each card is its own /goal, agents keep looping until judges confirm.

Hermes /goal commands

/goal Fix every failing test in tests/auth/ and confirm scripts/run_tests.sh passes

/goal status
/goal pause
/goal resume    # resets turn counter
/goal clear

/subgoal add a regression test for the JWT refresh bug
/subgoal        # list subgoals

Cheap judge routing (optional) in ~/.hermes/config.yaml:

goals:
  max_turns: 20

auxiliary:
  goal_judge:
    provider: openrouter
    model: google/gemini-3-flash-preview

Anti-patterns (and fixes)

Anti-patternWhy it failsWrite instead
“Make it better”No judge checklistTests + lint + grep rules
End state = “agent says done”Self-gradingCommand exit codes and file artifacts
No scope limitsDrift into CI, secrets, depsDirectory whitelist
Seven tasks in one goalJudge thrashesSplit across Kanban cards
Trust build output in chatFake PASSHermes re-runs build/test in shell

Verifier checklist (shell-first)

  • Did the agent paste stdout/stderr from the real command, or only claim success?
  • Re-run npm test / pytest / cargo test yourself—or let Hermes run it before marking done
  • Check git status and diff scope match constraints
  • Reject goals completed on a dirty tree without an explicit branch strategy
  • Read the judge reason on ↻ Continuing / ✓ Goal achieved lines when verdicts look wrong

Example goal (copy-paste template)

/goal Implement JWT refresh tokens for the auth module.

Measurable end state:
- pytest tests/auth/ -q exits 0 with ≥90% coverage on app/auth/
- bandit -r app/auth/ reports no HIGH issues
- docs/API.md lists /auth/refresh with request/response schema
- git status --porcelain is empty

Constraints:
- Only edit app/auth/ and tests/auth/
- No changes to billing or admin packages
- Conventional Commits; one commit per logical step

Stop: after 20 turns write BLOCKERS.md and pause.

Performance and cost snapshot

KnobDefault / typicalNotes
Hermes continuation budget20 turnsAuto-pause; /goal resume for another chunk
Judge call size~200 output tokens / turnRoute to cheap model to save cost
Judge errorsFail-open → continueBudget is the hard backstop
User message during goalPreempts continuationYour input wins over auto-loop
Codex plugin reviewRead-only/codex:review does not mutate files
Token win vs “keep going”Fewer human turnsTrade-off: longer autonomous spend per goal

Saboo’s line—“Workers change. The primitive stays the same.”—is the takeaway: whether Codex ships the feature, Claude reviews the diff, or Hermes orchestrates from a Mac Mini, success depends on defining done in the shell and verifying before you believe. Start with a 10-minute goal (four files, one test command), watch the judge loop once, then promote the same template to your real refactor.

Research supplement

The following sources provide additional context for the tools and concepts discussed in this article. All URLs are from official sources or well-established repositories.

  • OpenAI Codex (agentic coding agent, May 2025): OpenAI reintroduced the Codex name in May 2025 as a cloud-hosted autonomous coding agent — distinct from the 2021 code-completion model. It accepts natural-language goals and executes multi-step coding tasks in isolated sandboxes. Official announcement: openai.com/index/introducing-codex/
  • OpenAI Codex open-source CLI: The open-source terminal-based Codex CLI agent, configurable for suggest / auto-edit / full-auto approval modes. Repository: github.com/openai/codex
  • Claude Code extensibility via MCP: Claude Code's primary extensibility mechanism is the Model Context Protocol (MCP), which allows external tools and services to expose capabilities as Claude-callable tools. Official documentation: docs.anthropic.com/en/docs/claude-code
  • Model Context Protocol (MCP): The open protocol Anthropic released in late 2024 that underpins Claude Code's plugin ecosystem, now widely adopted across the AI tooling landscape. Specification and SDKs: modelcontextprotocol.io
Note: Web search was unavailable during research for this article. The sources above are from training data current to August 2025. Readers should verify that referenced repositories and URLs remain active and up-to-date.
---

References

Categories
News

CodeGraph: Cut AI Coding Agent Tool Calls With a Local Semantic Code Index

CodeGraph pre-indexes your repository into a local semantic knowledge graph so coding agents (Claude Code, Cursor, Codex, OpenCode, and others) spend fewer tokens on grep-and-read exploration. Dr. Alvaro Cintas’s LinkedIn post highlights up to 94% fewer tool calls and 77% faster codebase exploration; the project’s own multi-repo benchmarks report medians closer to ~71% fewer calls and ~46% faster on average—with the largest wins on big TypeScript and Rust trees.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  REPO[Source files] --> INDEX[tree-sitter + SQLite graph]
  INDEX --> MCP[CodeGraph MCP server]
  MCP --> AGENT[Claude Code / Cursor / Codex]
  AGENT --> OUT[Answer with fewer Read/Grep loops]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class AGENT agent
  class INDEX hook
  class MCP hook
Pre-indexed code graph replaces repeated grep and file-read loops with fewer agent tool calls

Without an index, agents re-scan the repo; CodeGraph answers from a local map built once.

The problem: exploration burns tokens

When an agent lacks structural context, it often spawns Explore sub-agents that chain grep, glob, and Read across thousands of files—paying model tokens for every hop. Architecture questions on repos like VS Code or Excalidraw can balloon to dozens of tool calls and millions of tokens before the model reads the right module.

Symbol and call-graph index stored locally on the developer machine for MCP agents

SQLite-backed graph keeps source intelligence on-device for Claude Code, Cursor, and Codex.

What CodeGraph does

CodeGraph (MIT, by Colby McHenry) builds a pre-indexed graph on your machine: symbols, call relationships, full-text search (SQLite FTS5), framework routes, and optional cross-language bridges (Swift↔ObjC, React Native, Expo). Agents reach it through an MCP server (codegraph serve --mcp)—no source upload, no API keys for indexing.

  • Smart context: tools like codegraph_context return entry points, related symbols, and snippets in one shot
  • Traversal: explore callers, callees, and impact radius before refactors
  • Routes: 14+ web frameworks (Django, FastAPI, Express, NestJS, Rails, Spring, Gin, Axum, etc.) link URL patterns to handlers
  • Fresh index: native file watchers (FSEvents / inotify / ReadDirectoryChangesW) with debounced re-sync; staleness banners during pending updates
  • 20+ languages: TypeScript, Python, Rust, Go, Java, Swift, Kotlin, C#, PHP, and more

Benchmarks (with vs without CodeGraph)

Official methodology (re-validated on v0.9.4, May 2026): headless Claude Code with Opus 4.7, one architecture question per repo, 4 runs per arm, median reported. WITH = CodeGraph MCP enabled; WITHOUT = empty MCP config but built-in Read/Grep/Bash still available.

CodebaseLanguageTool calls savedTokens savedTime savedCost saved
VS CodeTypeScript (~10k files)85%78%52%26%
ExcalidrawTypeScript (~640 files)96%90%73%52%
TokioRust (~790 files)92%86%71%82%
DjangoPython (~3k files)53%36%19%12%
AlamofireSwift (~110 files)83%64%48%47%
GinGo (~110 files)40%34%27%21%
Average7 repos71%57%46%35%

Example medians on VS Code (“How does the extension host communicate with the main process?”): 8 tool calls with CodeGraph vs 55 without; ~601k vs ~2.8M tokens. Cintas’s 94% / 77% figures align with the best large-repo cells (e.g. Excalidraw 96% fewer calls, 73% faster)—not every project sees that peak; small repos like Gin show narrower margins because naive search is already cheap.

Supported agents

Interactive installer wires MCP config for Claude Code, Cursor, Codex CLI, OpenCode, Hermes Agent, Gemini CLI, Antigravity, and Kiro. The LinkedIn post matches the README’s core quartet plus OpenCode.

Install and index a project

# macOS / Linux (bundled runtime — no Node required)
curl -fsSL https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh | sh

# Or npm / npx
npm i -g @colbymchenry/codegraph

# Register MCP with your agent(s)
codegraph install

# Index the repo (interactive init)
cd your-project
codegraph init -i

# Optional: serve MCP manually
codegraph serve --mcp

Indexes live under .codegraph/ per project. Remove agent integration with codegraph uninstall; drop project data with codegraph uninit.

Why it wins (and when it does not)

Works wellLess benefit
Large monorepos and architecture / “how does X work?” questionsTiny codebases where grep is already fast
Privacy-sensitive or air-gapped work (100% local SQLite)Agents that ignore MCP and delegate everything to file-reading sub-agents
Impact analysis before wide refactorsTasks needing live unindexed assets only the watcher has not synced yet
Multi-language mobile (RN / Expo bridging)One-off edits where the model already knows exact file paths

Maintainers note CodeGraph only helps when the primary agent queries the graph directly; otherwise Explore sub-agents may still burn tokens on raw file reads. Project instructions steer agents toward codegraph_context first, then targeted exploration—mirroring the “don’t burn tokens exploring” message in Cintas’s post.

Performance snapshot

MetricTypical range (official medians)
Fewer tool calls40–96% per repo; ~71% average
Fewer tokens13–90%; ~57% average
Faster wall time19–73%; ~46% average
Lower run cost (Claude Opus 4.7)2–82%; ~35% average
Calls with index (VS Code example)8 vs 55 without
License / hostingMIT tool; index stays local

For teams running agents on big codebases daily, CodeGraph is a practical layer between “raw repository” and “model context”: pay indexing cost once, then replace repetitive discovery loops with graph queries. Start with codegraph init -i on your main app, confirm MCP is active in your agent, and compare tool-call counts on the same architecture prompt—with and without the index.

Research supplement

Web search and external fetch tools were not accessible during this run. No additional verified sources could be retrieved beyond the author-provided references. The ANALYSIS and MEDIUM sections draw on domain knowledge of semantic search, RAG architectures, and agentic LLM tool-use patterns; specific claims about CodeGraph's internals should be verified against the live documentation and GitHub repository before publication.

References

Categories
News

Voxtral TTS: Mistral’s Open-Weight Voice Model vs ElevenLabs (What Changed)

Voxtral TTS is Mistral’s new open-weight text-to-speech model—a 4B-parameter stack aimed at voice agents that can be self-hosted or called via API. Viral posts claim Mistral “made ElevenLabs open source”; in practice Mistral shipped a competing TTS layer with public weights on Hugging Face, not ElevenLabs’ proprietary models.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  TEXT[Text + 3s voice reference] --> VOX[Voxtral TTS 4B]
  VOX --> SEM[Semantic tokens AR]
  SEM --> FLOW[Flow-matching acoustic]
  FLOW --> CODEC[Voxtral codec 12.5 Hz]
  CODEC --> AUDIO[24 kHz speech stream]
  AUDIO --> AGENT[Voice agent / support / dubbing]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class VOX agent
  class AUDIO agent
  class SEM hook
  class FLOW hook
Open-weight TTS runs on your infrastructure while closed APIs process audio in a vendor cloud

Voxtral ships downloadable weights; proprietary voice platforms keep models behind APIs.

What actually launched

On 23 March 2026, Mistral announced Voxtral TTS: its first production TTS model for enterprise voice workflows. Weights and preset reference voices live on Hugging Face under CC BY-NC 4.0 (research and non-commercial use of those weights; commercial deployment typically goes through Mistral’s paid API). ElevenLabs remains a separate, closed platform—Mistral’s pitch is quality and control without renting every audio frame from a single vendor.

A short voice sample plus text becomes streaming speech for voice agents

Three-second cloning and emotion steering target real-time agent workflows.

Headline specs

DimensionVoxtral TTS (Mistral)Typical closed TTS (e.g. ElevenLabs)
WeightsOpen on HF; self-host with vLLM-Omni (≥16 GB GPU)API-only; no public weights
Size~4B parameters (Ministral 3B backbone + acoustic stack)Undisclosed proprietary stacks
Languages9: EN, FR, DE, ES, NL, PT, IT, HI, ARBroader catalogue and voice library on incumbent platforms
Voice cloneFrom ~3 s reference; captures accent, pauses, disfluenciesMature cloning on flagship tiers
Latency~70 ms time-to-first-audio (10 s ref + 500 chars, per Mistral)Flash-tier products optimised for low TTFA
API price$0.016 / 1k characters (Mistral API)Tiered subscriptions + usage caps
Human eval vs Flash v2.568.4% preference in zero-shot multilingual cloning (paper)Incumbent benchmark for fast tier
Emotion / prosodyEmotion steering (neutral, happy, sarcastic, etc.); claimed parity with ElevenLabs v3 tierv3 often cited for expressive flagship voices

Architecture (how it works)

Voxtral TTS is a hybrid generative stack, not a single end-to-end waveform net:

  • 3.4B transformer decoder (autoregressive semantic speech tokens)
  • 390M flow-matching acoustic transformer (16 NFEs per frame)
  • 300M in-house Voxtral codec (semantic VQ + acoustic FSQ at 12.5 Hz)
  • Inputs: text + voice prompt (roughly 5–25 s in the technical blog; cloning demos use 3 s minimum)
  • Output: 24 kHz audio (WAV, PCM, FLAC, MP3, AAC, Opus via API)

Zero-shot cross-lingual adaptation is a differentiator: e.g. English text with a French reference clip can yield French-accented English without explicit cross-lingual training—useful for dubbing and cascaded speech-to-speech pipelines alongside Voxtral Transcribe.

Benchmarks and caveats

Mistral and the arXiv report emphasise native-speaker listening tests, not word-error rate alone. Reported highlights:

  • 68.4% win rate vs ElevenLabs Flash v2.5 on multilingual zero-shot custom voices
  • Competitive with strong proprietary systems on flagship preset voices (smaller margin than cloning setup)
  • Automatic metrics: strong on SEED-TTS / MiniMax-TTS; speaker-similarity claims vs ElevenLabs v3 in paper tables
  • vLLM-Omni on one H200: ~70 ms latency at concurrency 1; RTF ≈0.10 (standard convention: lower is faster)

Read claims carefully: evaluations were run by Mistral; Flash is the speed tier, while ElevenLabs v3 is the expressive flagship—Mistral argues parity on emotion, not a clean “beats everything” sweep. CC BY-NC is not the same as Apache-style commercial open source: product teams needing unrestricted commercial use of weights should confirm license terms or use the API.

Run it yourself (vLLM-Omni)

# Install (see HF model card for pinned versions)
uv pip install -U vllm
uv pip install vllm-omni --upgrade  # >= 0.18.0
python3 -c "import mistral_common; print(mistral_common.__version__)"  # >= 1.10.0

# Serve on a GPU with >= 16 GB VRAM
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni

# Client (OpenAI-style audio/speech endpoint)
import io, httpx, soundfile as sf
BASE_URL = "http://localhost:8000/v1"
payload = {
    "input": "Paris is a beautiful city!",
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "response_format": "wav",
    "voice": "casual_male",
}
r = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
r.raise_for_status()
audio, sr = sf.read(io.BytesIO(r.content), dtype="float32")

Try presets in Mistral Studio or Le Chat; record a custom reference for cloning. Gradio demo ships with vllm-omni examples; a Hugging Face Space is linked from the model card.

When to pick which stack

Choose Voxtral TTSStay on incumbent TTS (e.g. ElevenLabs)
Data must stay on your VPC / edge deviceNeed 20+ languages or huge preset voice marketplace
Voice agent at scale with predictable infra costTurnkey enterprise agent platform + compliance bundle
Research, NC fine-tuning, or Mistral API at $0.016/1k charsMaximum expressive flagship quality without running GPUs
Multilingual cloning in the 9 supported localesHeavily regulated workflow already certified on one vendor

Enterprise use cases (from Mistral)

  • Customer support and contact-centre voice bots
  • Banking / KYC voice agents (demo narratives in launch materials)
  • In-vehicle and industrial hands-free UX
  • Real-time translation and dubbing with cross-lingual voice carry-over
  • Sales, marketing, and compliance read-outs paired with Voxtral speech-to-text

Performance snapshot

MetricValueNotes
Parameters4BBF16 weights on HF
TTFA~70 ms10 s reference + 500 characters (Mistral blog)
RTF≈9.7×Generates faster than real time (company blog)
Clone reference≥3 sUp to ~2 min generation per native chunk; API can interleave longer jobs
Human preference vs Flash v2.568.4%Zero-shot multilingual custom voice test
API pricing$0.016 / 1k charsMistral API; self-host avoids per-char fees, not GPU cost

The LinkedIn framing—“Mistral made ElevenLabs open source”—captures the shift (frontier TTS weights you can run yourself) more than the literal fact pattern. For builders, the actionable story is simpler: Voxtral TTS is a credible open-weight speech layer for agents, with measured wins on cloning latency and multilingual naturalness, while proprietary incumbents still win on ecosystem breadth until you need on-prem control.

Research supplement

Technical details confirmed from the official HuggingFace model card (mistralai/Voxtral-4B-TTS-2603):

  • License: CC BY-NC 4.0 (not Apache 2.0 — non-commercial open-weight)
  • Base model: mistralai/Ministral-3-3B-Base-2512
  • Minimum GPU memory: 16 GB
  • Serving framework: vLLM Omni v0.18.0+
  • Benchmark hardware: single NVIDIA H200
  • Throughput at concurrency 32: 1,430 characters/second/GPU
  • Voice references sourced from EARS, CML-TTS, IndicVoices-R, and Arabic Natural Audio datasets

The official research paper is available at arxiv.org/abs/2603.25551. The Mistral announcement is at mistral.ai/news/voxtral-tts.

---

References

Categories
News

Run Qwen 3.6 MTP in llama.cpp: Faster Local Inference With Built-In Speculative Decoding

Multi-token prediction (MTP) in llama.cpp speeds up local Qwen 3.6 generation by building speculative decoding into the model itself—Hugging Face CTO Julien Chaumond’s quickstart shows you only need a recent build, an MTP GGUF from ggml-org, and two flags on llama-server.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  CLI[llama-server + MTP GGUF] --> FLAGS["--spec-type draft-mtp"]
  FLAGS --> DENSE[Dense 27B MTP]
  FLAGS --> MOE[MoE 35B-A3B MTP]
  DENSE --> OUT[Faster token stream]
  MOE --> OUT

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class CLI agent
  class OUT agent
  class FLAGS hook
MTP drafts several tokens ahead then the main model confirms them for faster output

Multi-token prediction bundles draft guesses inside the same model file so decode steps emit more accepted text.

What MTP changes

MTP is a draft head trained with the base model, not a separate small “speculator” you download and wire up by hand. At decode time the head proposes several candidate next tokens; the main model verifies them in one pass. When draft tokens are accepted, you emit more text per forward step—Chaumond and the merged llama.cpp MTP PR (#22673) describe roughly ~2× generation throughput in favourable setups, though real gains depend on hardware, quantisation, and how many draft tokens you allow.

The MTP weights ship in the same GGUF as the main checkpoint; llama.cpp loads a lightweight MTP context (extra KV cache, typically under ~10% memory versus the full model). You opt in with flags—MTP does not run unless you ask for it.

Choose dense 27B MTP for balance or MoE 35B-A3B MTP for maximum throughput

Both checkpoints use the same MTP flags; pick the variant that matches your RAM and speed goals.

Prerequisites

RequirementDetail
llama.cpp buildMTP merged 16 May 2026; Chaumond suggests brew upgrade llama.cpp or brew install llama.cpp --HEAD until package managers ship build 9200+
Model filesQwen3.6-27B-MTP-GGUF (dense) or Qwen3.6-35B-A3B-MTP-GGUF (MoE)
Memory~48–64 GB RAM or VRAM comfortable; ~36 GB may work with stronger quants (Q4/Q6, Unsloth-style packs)
Pull models-hf ggml-org/… on llama-server downloads from the Hub automatically

Commands (copy-paste)

Install or refresh llama.cpp, then start the server with MTP enabled. Chaumond’s post uses --spec-draft-n-max 2 on dense and 3 on MoE; community benchmarks on the MoE often favour n-max 2 when acceptance rate drops at wider draft windows.

# Refresh llama.cpp (macOS example)
brew upgrade llama.cpp
# Or until stable packages catch up:
# brew install llama.cpp --HEAD

# Dense 27B — balanced quality (~30 tok/s on author’s box)
llama-server -hf ggml-org/Qwen3.6-27B-MTP-GGUF \
  --spec-type draft-mtp --spec-draft-n-max 2

# MoE 35B-A3B — much faster when it fits (~100 tok/s in the post)
llama-server -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF \
  --spec-type draft-mtp --spec-draft-n-max 3

Optional: add --no-mmproj if you do not need vision—saves memory. Advanced users can combine MTP with ngram drafting on supported builds; treat that as experimental.

Dense vs MoE: which to pick

VariantWhen it fitsDraft depth (starting point)Notes from the thread
Dense 27B MTPSingle-GPU rigs aiming for steady quality--spec-draft-n-max 2Chaumond reports ~30 tok/s locally; PR benches show ~1.8–2× decode vs no MTP on RTX 3090-class setups
MoE 35B-A3B MTPHigh RAM/VRAM, throughput-first coding/chatTry 2 first, then 3Post claims ~100 tok/s; independent runs show +20–30% at n-max 2, shrinking or negative returns at n-max 4 when acceptance falls

How to read speed-up claims

  • Decode vs prefill: MTP mainly helps token generation; prompt processing can be slower because of extra embedding transfers (noted in the PR).
  • Acceptance rate: Wider --spec-draft-n-max drafts more tokens per step but wastes work when guesses are wrong—measure predicted_per_second and draft acceptance, not prompt-processing rate.
  • Quality: PR authors ran AIME-style evals; scores stayed in line with Qwen’s published benchmarks when MTP is enabled.
  • Hardware spread: Strix Halo, RTX 4090/5090, and laptop 6 GB+RAM reports range from modest (~1.2×) to near ~2× depending on quant and n-max.

Common confusion (answered)

QuestionAnswer
Do I need a second GGUF for the draft model?No for MTP—one MTP-tagged GGUF includes the head; classic speculative decoding still uses a separate small draft checkpoint.
Why does my MoE slow down with n-max 3?Lower acceptance means rejected drafts cost extra compute—try 2 and watch acceptance in server logs.
Does MTP work with tensor parallel / vision?Yes in principle per the PR; some backend combos (e.g. tensor split + MTP) were still being fixed—test your stack.
Is this the same as “sharing to the Hub”?No—the LinkedIn slug is generic; this post is specifically about running Qwen 3.6 MTP locally in llama.cpp.

Performance snapshot

ScenarioApproximate effectSource
27B Q6_K, RTX 3090 decode22.4 → 42.5 tok/s (~1.9×)PR comment benchmark, MTP on vs off
35B-A3B MoE, 6 GB VRAM + 64 GB RAM22.9 → 29.4 tok/s at n-max 2Community bench in PR thread
Author machine (Chaumond)~30 tok/s dense, ~100 tok/s MoELinkedIn post (May 2026)
MoE MXFP4, RTX PRO 24 GB91 → 111 tok/s at n-max 2 (~+22%)LinkedIn comment (not ~2×)

MTP turns Qwen 3.6 local runs from “one token per heavy step” into “verify a short bundle of guesses”—with a single Hub pull and two CLI flags once llama.cpp is current. Start with the dense GGUF if memory is tight; reach for the MoE MTP pack when you have headroom and care about tokens per second for long coding or agent loops.

Research supplement

Web search was not available in this session. The following context is drawn from training knowledge and the author's reference links.

  • MTP origins: Multi-Token Prediction as a training objective was formalised in Meta's 2024 paper showing that training models to predict multiple future tokens simultaneously improves both sample efficiency and downstream task performance, with the side effect of producing usable draft heads for inference-time speculation.
  • DeepSeek precedent: DeepSeek models (notably DeepSeek-V3 and DeepSeek-R1) also shipped with MTP heads and demonstrated real-world inference speedups using them, establishing the pattern that Qwen 3.6 follows.
  • llama.cpp PR #22673: The merged pull request is the authoritative reference for implementation details, accepted flags, and any caveats around quantization compatibility. Readers building from source should verify their commit is at or after this merge.
  • ggml-org GGUF files: The Qwen3.6-27B-MTP-GGUF and Qwen3.6-35B-A3B-MTP-GGUF repositories on Hugging Face are the canonical download locations and include model cards with quantization options.
---

References

Categories
News

HF Viewer: Interactive Hugging Face Model Architecture Graphs in Your Browser

HF Viewer (hfviewer.com) is a free browser tool from Embedl that turns any public Hugging Face model into an interactive architecture graph—paste a repo URL, swap huggingface.co for hfviewer.com, or embed the graph in your model card without installing PyTorch locally.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  HF[Hugging Face model page] --> URL[hfviewer.com/owner/model]
  URL --> GRAPH[Interactive architecture graph]
  GRAPH --> ZOOM[Granularity: overview to blocks]
  GRAPH --> EMBED[Optional README embed]
  GRAPH --> EXT[Chrome extension on HF]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class GRAPH agent
  class HF hook
  class EMBED hook
Browser URL changes from Hugging Face to HF Viewer and opens an interactive block diagram

The fastest way to open a graph is to change the domain in any public model link.

What HF Viewer does

Model cards explain what a checkpoint is for; they rarely give you a fast map of how it is wired. HF Viewer fills that gap: open a graph of layers, attention blocks, MoE routes, vision encoders, and merges directly in the browser. Embedl describes it as a “first architectural pass” before you read configs, trace code, or plan deployment and latency.

Overview diagram on the left expands into detailed nested blocks on the right via a granularity control

Use granularity levels to move from system shape down to specific traced paths.

Three ways to open a graph

MethodHowBest for
URL swapReplace huggingface.co with hfviewer.com in any model URLZero setup; sharing links with teammates
Paste on homepageFull HF URL, hfviewer URL, or owner/modelQuick lookup from chat or docs
Chrome extension“Hugging Face Viewer” on HF model pagesBrowsing many repos in one session

Example: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro becomes https://hfviewer.com/deepseek-ai/DeepSeek-V4-Pro.

Granularity and exploration

The viewer exposes granularity levels: start at the high-level system shape (encoder–decoder, decoder-only, dual-tower CLIP, sparse MoE), then drill into traced sub-blocks and data paths. That slider is useful when you care whether a vision tower feeds a merger, how many decoder layers repeat, or where experts route.

Popular entry points on the site include gpt2 (classic decoder), t5-small (deeper encoder–decoder), openai/clip-vit-base-patch32 (dual encoder), google/vit-base-patch16-224, Qwen/Qwen3.5-4B, deepseek-ai/DeepSeek-V4-Pro (sparse MoE), and nvidia/parakeet-tdt-0.6b-v3 (Conformer speech).

Gemma 4 family compare

hfviewer.com/family/gemma-4 lines up the Gemma 4 lineup with synchronised pan, zoom, and granularity so you can compare variants side by side—useful when size classes differ but the narrative in a blog post refers to a specific block (Embedl links prose sections to graph regions for a text↔graph reading loop).

Embed graphs in Hugging Face READMEs

The model-card embed builder generates HTML in roughly ten seconds: paste owner/model, pick card style (standard summary or block granularity), copy HTML into README.md. Community models already showcase embedded cards (custom GPT-X2 stacks, MEGA-based small LMs, emotion classifiers, Pegasus-X summarisation, Gemma 4 fine-tunes, and others).

If a visualization is not ready yet, the embed page offers email notification when generation completes—then you copy the final widget HTML.

How graphs are built (high level)

HF Viewer derives structure from Hugging Face model metadata and PyTorch module layout. Embedl staff on Hacker News noted multiple passes over the HF config, sometimes including torch.export and recombination steps to make repeated layer classes readable in the graph—hybrid architectures (Mamba + attention, MoE) remain harder and community feedback has flagged occasional mis-labelling on complex stacks.

It visualises the implemented architecture, not every hyperparameter from the card (hidden size, layer count, tokenizer details may appear inconsistently). It does not replace reading the paper or source for training and numerics.

Who it is for

  • Developers comparing candidate open models before fine-tuning or quantisation
  • Authors who want an architecture graphic on the model card
  • Technical writers linking blog sections to live graph nodes
  • Teams evaluating Embedl’s edge deployment products after inspecting structure

Limitations

  • Public Hugging Face models only—private or local checkpoints are out of scope
  • Browser-side—very large or exotic graphs may be slow or ambiguous
  • Not a substitute for config files, weights inspection, or benchmark numbers
  • Complex hybrids may need manual verification (community reports on some Nemotron-style layouts)

Embedl context

Embedl (edge AI optimisation, quantisation, MLOps) positions HF Viewer as a community gift to Hugging Face users; the homepage cross-links embedl deploy, embedl hub, and optimised GenAI models for teams moving from exploration to edge deployment.

At a glance

QuestionAnswer
What is it?Interactive HF model architecture viewer
Cost?Free web tool (+ Chrome extension)
Fastest entry?Swap huggingface.cohfviewer.com
Embed in README?model-card-embed
Made by?Embedl

Research supplement

Web search and fetch were unavailable in this environment; no additional reputable sources beyond the author's provided reference links could be retrieved and verified. The reference links below (provided by the author) are the primary external sources for this article.

---

References

Categories
News

DeepSWE Benchmark: How Datacurve Separates Real Agentic Coding Ability

DeepSWE, released by Datacurve on 26 May 2026, is a long-horizon agentic coding benchmark built to show where frontier models actually diverge when public leaderboards make them look neck-and-neck—113 original tasks across 91 open-source repositories and five languages, with hand-written behavioural verifiers and no solutions lifted from public pull requests.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  P[Short behaviour-focused prompt] --> A[Coding agent in isolated repo]
  A --> PATCH[Multi-file patch]
  PATCH --> V[Hand-written verifier]
  V -->|pass| OK[Task solved]
  V -->|fail| NO[Regression or wrong behaviour]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class A agent
  class V hook
  class OK agent
Three models look equally capable on easy benchmarks but separate widely on harder long-horizon tasks

DeepSWE is meant to mirror day-to-day agent gaps that saturated leaderboards hide.

What Serena Ge announced

Datacurve CEO Serena Ge (@serenaa_ge) posted that DeepSWE is a new standard for agentic coding benchmarks: on many public leaderboards, top models cluster in a narrow band, but DeepSWE is designed to reflect how developers experience agents in day-to-day work—with a much wider spread between best and worst performers.

Primary materials: deepswe.datacurve.ai, the methodology blog, and the open benchmark repo datacurve-ai/deep-swe. Runs use Pier with mini-swe-agent on Modal sandboxes.

Short prompt flows into repo editing by a coding agent and behavioural verification by hand-written tests

Each task is an original change in a real repository, graded on observable behaviour not patch shape.

Four design bets vs older benchmarks

PropertyWhat DeepSWE doesWhy it matters
Contamination controlTasks written from scratch; fixes are not copied from merged PRs and are not merged upstreamTests problem-solving, not recall of a public patch
Diversity113 tasks, 91 repos, 5 languages (TypeScript, Go, Python, JavaScript, Rust)Broader than SWE-bench Pro’s ~11 public repos
Real workload sizeShorter prompts (~2.2k chars mean) but ~5.5× more reference solution lines than SWE-bench Pro (~668 vs ~120)Less prescriptive prompts, more engineering work per task
Verification qualityHand-written tests for observable behaviour, not inherited PR test suites onlyDatacurve reports 0.3% false positives vs 8.5% on SWE-bench Pro (audited sample)

Leaderboard snapshot (mini-swe-agent harness)

All listed scores use the same agent harness so rankings reflect model differences, not Codex vs Claude Code scaffolding. Datacurve reports confidence intervals on pass rates; figures below are point estimates from the public leaderboard.

Model (config)DeepSWE pass ratePublic SWE-bench Pro (reported)
gpt-5.5 [xhigh]70% ± 4%~59%
gpt-5.4 [xhigh]56% ± 5%~58%
claude-opus-4.7 [max]54% ± 5%~64% (often ranked #1 on Pro)
claude-sonnet-4.6 [high]32% ± 4%
gemini-3.5-flash 28% ± 4%
gpt-5.4-mini [xhigh]24% ± 4%
kimi-k2.624% ± 4%
claude-haiku-4.50% on DeepSWE~39% on SWE-bench Pro

On these models, Datacurve notes DeepSWE pass rates span roughly 70 percentage points from worst to best versus about 30 points on publicly reported SWE-bench Pro scores—matching the tweet’s claim that leaderboards can hide real-world gaps.

Efficiency: score is not the whole story

ModelMedian cost / trialMedian wall timeMedian output tokens
gpt-5.5~$5.80~20 min~47k
gpt-5.4~$3.30
claude-opus-4.7Higher spend per run (blog charts)

Datacurve’s analysis stresses that more tokens, longer runs, or higher dollar cost do not reliably mean more passes—teams choosing an agent should weigh accuracy, latency, and price together, not assume the loudest/longest run wins.

Task format and how to run it

Tasks follow the Harbor layout: task.toml, instruction.md, Docker environment, tests/ verifier, and a held-out solution/ for human review only. Example task themes on the site include PromQL label sorting, Yjs map conflict policies, Wasm trap coredumps, and XML diff/merge in Go.

git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier

export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

# Random 10-task subset
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

Why SWE-bench Pro rankings can mislead

Datacurve’s qualitative audit highlights structural issues on PR-derived benchmarks—notably gold commits visible in .git history (Claude Opus sometimes recovers fixes via git show), tests that import private helpers the prompt never names, and prompts that tell agents not to write tests—which suppresses self-verification behaviour strong models use on DeepSWE. DeepSWE shallow-clones the base commit so there is no merged fix hash to read.

Reported verifier disagreement rates (LLM judge vs automated grader, sampled rollouts): SWE-bench Pro ~32% disagreement overall; DeepSWE ~1.4%. False negative rates were ~24% vs ~1.1% respectively in their audit—wide error bars on older benchmarks make small leaderboard deltas hard to trust.

Failure modes developers should know

  • Claude families — often miss one branch of multi-part prompts (“sync and async”, “line and block comments”).
  • GPT-5.x — Datacurve finds lower MISSED_REQUIREMENT rates; tends to implement prompts literally.
  • Cheating on Pro — Opus passes via reading gold history; GPT-5.x showed none in their sample.
  • Weaker models — may skip running existing tests entirely on hard tasks.

Limitations (from Datacurve)

  • Fixed mini-swe-agent harness—not native Claude Code / Codex CLI / Cursor workflows.
  • Open-source repos with ≥500 stars only—may not reflect private or long-tail codebases.
  • Five languages; C++, Java, and heavy refactor/localisation tasks under-represented.
  • Qualitative tags use an LLM analyzer—some verdicts will be wrong.

Who should care

  • Engineering leaders picking coding agents for production—not just benchmark leaderboard rank.
  • Model labs needing contamination-resistant, long-horizon evals.
  • Datacurve customers — the company sells curated coding data for frontier training; DeepSWE doubles as research marketing.

At a glance

QuestionAnswer
What is DeepSWE?113-task agentic SWE benchmark from Datacurve
Top score (May 2026)?gpt-5.5 ~70% with mini-swe-agent
Main claim?Wider model separation than saturated public benchmarks
Run it?deep-swe repo + pier + API keys
Source announcement@serenaa_ge · deepswe.datacurve.ai

Research supplement

Web access was unavailable during this drafting session; the reference URLs (deepswe.datacurve.ai, DeepSWE methodology blog, and datacurve-ai/deep-swe on GitHub) should be fetched directly to verify leaderboard scores, exact task counts, contamination methodology details, and the list of repositories used in evaluation before any specific numbers are cited in the article. The source tweet (@serenaa_ge, status 2059308218564890875) may contain additional launch context and model-specific score comparisons worth incorporating.

References

Categories
News

Simi by Lamina Labs: Whiteboard Explainer Videos From Prompts and Documents

Lamina Labs builds Simi, an AI explainer studio that turns a text prompt or uploaded document into a whiteboard-style video in seconds—aimed at students, course creators, customer training, and EdTech products that need concepts explained visually, not as walls of text.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  IN[Prompt or PPT/PDF/Word/TXT/MD] --> SIMI[Simi generation]
  SIMI --> ANIM[Step-by-step whiteboard animation]
  ANIM --> MP4[Explainer MP4]
  MP4 --> USE[Students / L&D / EdTech apps]
  SDK[lamina-sdk] --> API[api.laminalabs.ai]
  API --> SIMI

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class SIMI agent
  class ANIM agent
  class SDK hook
  class API hook
Lesson document and text prompt flow into Simi and become a step-by-step whiteboard explainer on screen

Simi accepts uploads or a short description and outputs a drawn explainer video instead of static slides.

What Lamina Labs is building

At laminalabs.ai, Lamina positions itself as the visualisation layer for AI-native EdTech: infrastructure that helps intelligent systems draw, explain, and teach. The consumer-facing product is Simi (“AI explainer studio”), marketed as the world’s fastest explainer video tool—drop a document or type an idea, get a clear whiteboard walkthrough.

The company is a Y Combinator Spring 2026 batch startup (YC profile), founded in 2025 and based in San Francisco with a two-person founding team: Kartikesh Mishra (MIT EECS BS ’24, MEng ’25) and Sudip Rokaya (MIT CS & Math, on leave). Founders offer “Talk to Founder” booking via the site and host the live app at app.laminalabs.ai/simi.

Naming note: laminalabs.ai (Simi / EdTech explainers) is unrelated to Lamini (LLM tuning at lamini.ai) and unrelated to uselamina.ai (e-commerce creative generation). This article covers Lamina Labs only.

Split comparison: flashy cinematic clip confuses learners versus numbered whiteboard strokes that build understanding

Lamina bets sequential drawing and pauses teach hard concepts better than glossy generative video.

How Simi is meant to feel

Lamina’s copy stresses pacing over production value: a rough line drawn in the right order should teach more than a glossy cinematic clip. Simi is described as drawing like a patient teacher—slow enough to follow, fast enough to stay engaged—with pauses as part of the pedagogy. Each stroke is framed as part of an argument (“because of this, therefore that”) rather than a finished illustration dropped on screen.

Example topics showcased on the homepage include recursion explained to a child, Netflix customer-support day-one training, and quantum tunnelling—signals that the product targets explanation-heavy STEM and onboarding content, not short-form social ads.

Inputs and outputs

InputOutput
Short natural-language promptWhiteboard-style explainer video (MP4)
Uploaded PowerPoint, PDF, Word, TXT, or MarkdownSame—document ingested as lesson source material
API prompt via lamina-sdkProgrammatic generation for agents and EdTech pipelines

The on-site workflow is deliberately simple: describe what to explain → Simi generates the animation → watch in seconds. Lamina argues a one-minute explainer is easier to share and rewatch than a five-page PDF, with less room for misreading.

Developer API: lamina-sdk

Integrators use the async-first Python package lamina-sdk (MIT licence, Python ≥3.11). The client defaults to https://api.laminalabs.ai; authenticate with LAMINA_API_KEY or pass api_key to simi().

from lamina import simi

async with simi(api_key="lamina_live_your_key") as client:
    video = await client.generate(
        "Explain derivatives with a simple graph",
        duration=20,
    )
    await video.save("lesson.mp4")

Additional patterns from the PyPI readme:

  • submit_async + stream_events for progress streaming
  • Callback style: onstream / oncompletion on jobs
  • Sync helpers: submit, generate, save
  • Dependencies: httpx, Pillow, websockets

Co-founder Sudip Rokaya’s public demos describe wiring Simi into agent stacks (for example Hermes Agent via Slack) so a single API call produces multi-minute whiteboard explainers without a video editor—positioning Simi as video generation infrastructure for EdTech platforms generating curriculum at scale, not only a web UI.

EdTech positioning vs other video AI

ApproachTypical outputLamina’s contrast
Cinematic / marketing AI videoShort clips, b-roll, adsNot optimised for step-by-step teaching
Notebook-style study toolsSlides, audio overviews, slower generationLamina markets Simi for sub-minute turnaround (founder benchmarks vs NotebookLM are marketing claims—verify for your workload)
Manim / After EffectsPrecise but labour-intensiveSimi trades manual timeline editing for prompt/document → video automation
Simi / LaminaSequential whiteboard strokes, explainer pacingBuilt for “watch it being drawn” pedagogy and API-scale generation

YC’s one-liner—“accurate visual explanations in seconds”—aligns with Lamina’s emphasis on correct explanatory visuals for learning, as opposed to templated or physically inconsistent generative video. Third-party databases sometimes reference an earlier internal name “Pictor”; the public product brand is Simi.

Who it is for

  • Students and self-learners turning lecture confusion into a rewatchable minute-long explainer
  • Course creators scaling lessons without hiring animators per concept
  • Customer education / L&D (onboarding flows like support training)
  • EdTech and agent builders embedding lamina-sdk so tutors, copilots, or curriculum bots emit video explanations automatically

Getting started

StepWhere
Try the studio UIapp.laminalabs.ai/simi (“Try for Free Now” on homepage)
Book founder callCal.com link from laminalabs.ai
Integrate via APIpip install lamina-sdk → API key → api.laminalabs.ai
Company contextY Combinator company page

At a glance

QuestionAnswer
What is Simi?Prompt/document → whiteboard explainer video
Who makes it?Lamina Labs (YC P26, San Francisco)
How do developers integrate?lamina-sdkapi.laminalabs.ai
What files can you upload?PPT, PDF, Word, TXT, MD
Core design bet?Sequential drawing and pacing beat cinematic AI for teaching

Research supplement

Live web fetch was not available in this session, so the following is sourced from training knowledge and the reference URLs provided by the author. Claims here should be verified against the live pages before publication.

  • lamina-sdk on PyPI: The package lamina-sdk is listed on the Python Package Index, confirming programmatic API access to Simi's generation capabilities. Version history, installation size, and dependency footprint should be checked at the live PyPI page to assess maturity.
  • Y Combinator company listing: Lamina Labs appears in the YC company directory. The batch year, team size, and any publicly stated fundraising details are available on that page and are worth including for readers assessing company stage.
  • Simi web app: The product is accessible at app.laminalabs.ai/simi. Pricing tiers, supported document formats, maximum video length, and available narration languages are the key variables to document from a live session with the tool.

References

Categories
News

OpenAI Secure MCP Tunnel: Private MCP Servers for ChatGPT, Codex, and the API

Secure MCP Tunnel lets teams keep Model Context Protocol (MCP) servers on private networks while ChatGPT, Codex, and the Responses API reach them through outbound-only HTTPS—no inbound firewall ports and no public MCP endpoint.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  P[ChatGPT / Codex / Responses API] --> E[OpenAI-hosted MCP tunnel endpoint]
  E --> CP[Control plane api.openai.com]
  CP --> TC[tunnel-client inside your network]
  TC --> MCP[Private MCP server]
  MCP --> DATA[Internal tools and data]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class P agent
  class E hook
  class TC hook
  class MCP agent
MCP server and tunnel-client stay inside the network; only outbound HTTPS reaches OpenAI; inbound from the internet is blocked

Secure MCP Tunnel avoids public MCP endpoints and inbound firewall rules by pulling work from inside your network.

What OpenAI Developers announced

On 27 May 2026, @OpenAIDevs posted that private MCP servers can stay inside your network while OpenAI products connect through outbound-only HTTPS, linking to the official Secure MCP Tunnel guide. Greg Brockman quoted the post as “bring-your-own MCP servers”; developers including Steven Heidel highlighted using the same path to connect the Responses API to local MCP servers.

Automated fetch of status 2059703536825565499 returned 403 in some environments; claims below align with that post (via syndication) and OpenAI’s published documentation and tunnel-client repository.

Three OpenAI surfaces connect through one secure tunnel bridge to a single private MCP server

The same tunnel-backed MCP server can power ChatGPT connectors, Codex sessions, and Responses API tool calls.

The problem it solves

Remote MCP usually means a public server_url that OpenAI’s platform can call over the internet. That is a poor fit when the MCP server lives on a laptop, in a VPC, or behind corporate firewalls. Opening inbound ports or publishing an internal tool stack is often blocked by security review.

Secure MCP Tunnel flips the direction: a customer-run agent, tunnel-client, inside your network initiates outbound HTTPS to OpenAI’s control plane, pulls queued MCP work, forwards JSON-RPC to the private server (stdio or HTTP), and posts responses back. The MCP server never needs a public listener.

Supported surfaces

OpenAI surfaceHow it uses the tunnel
ChatGPTConnectors can target a tunnel-backed private MCP server (create/verify connector while tunnel-client run is healthy)
CodexLocal or private MCP via tunnel; plugin/runtimes workflows documented in tunnel-client
Responses APIRemote MCP tool calls can reach private servers through the hosted tunnel endpoint
AgentKitListed alongside the above in the open-source client README as a supported consumer path

Network and control-plane flow

FromToPurpose
Host running tunnel-clientapi.openai.com:443 (/v1/tunnel/*)Default long-poll and response posting
Host running tunnel-clientmtls.api.openai.com:443Same paths when control-plane mTLS client certs are configured
Host running tunnel-clientLocal MCP (stdio command or private HTTP URL)Forward MCP JSON-RPC inside your boundary

The client long-polls GET /v1/tunnel/{tunnel_id}/poll and returns work via POST /v1/tunnel/{tunnel_id}/response. On startup it may fetch tunnel metadata from GET /v1/tunnels/{tunnel_id} for operator visibility. Optional mTLS uses --control-plane.client-cert / --control-plane.client-key (or env vars); with the default API host, control-plane traffic automatically targets mtls.api.openai.com.

When to use it

  • MCP server is on-premises, on a developer machine, or in a private VPC.
  • Security will not approve inbound internet access to the MCP process.
  • Outbound HTTPS to OpenAI (api.openai.com:443, or mTLS host) is allowed from the tunnel host.
  • You need ChatGPT, Codex, or API agents to call the same internal tools without exposing them publicly.

Quickstart (binary path)

OpenAI documents a binary-first path: download tunnel-client from Platform → Tunnels, create a tunnel (UI or tunnel-client admin tunnels create with an admin key), then run a profile against your local MCP server.

tunnel-client help quickstart

tunnel-client init \
  --sample sample_mcp_stdio_local \
  --profile local-stdio \
  --tunnel-id tunnel_0123456789abcdef0123456789abcdef \
  --mcp-command "python /path/to/server.py"

tunnel-client doctor --profile local-stdio --explain
tunnel-client run --profile local-stdio

For an HTTP MCP server inside the network, use an HTTP-oriented sample profile instead of stdio. Keep the daemon running while ChatGPT discovers the connector or while API/Codex sessions issue MCP calls. Health endpoints: /healthz, /readyz, /metrics, plus a local admin UI at /ui.

Keys, permissions, and workspace scope

CredentialTypical use
CONTROL_PLANE_TUNNEL_IDTunnel resource id from Tunnels management or admin CLI
CONTROL_PLANE_API_KEYRuntime API key for doctor and run (long-lived daemon)
OPENAI_ADMIN_KEYAdmin-only tunnel CRUD—not for the polling daemon

Runtime principals need Tunnels Read + Use; managers who create tunnels need Manage as well. If a tunnel does not appear in ChatGPT, docs call out checking workspace association and the connector operator’s Tunnels permissions.

Harpoon: scoped private HTTP (not a full proxy)

The tunnel client embeds Harpoon, an MCP server that exposes allowlisted HTTP targets by label so agent flows can call a small set of private REST endpoints through the tunnel. OpenAI stresses this is not a general-purpose proxy—callers cannot pick arbitrary hosts; methods and targets are customer-configured with bounded request/response limits.

Security and trust

Outbound-only networking reduces exposure, but you must trust the MCP server you attach. OpenAI’s MCP guidance warns that malicious remote servers can exfiltrate anything that enters the model context. Prefer official servers operated by the service provider; for private tunnels, treat tunnel-client hosts like production infrastructure: patch the binary, rotate runtime keys, scope tunnels to the right workspace, and review tools exposed by your MCP implementation.

Public MCP vs Secure MCP Tunnel

ApproachMCP server exposureFirewallBest for
Remote server_urlInternet-reachable HTTPS endpointOften requires inbound or public LBVendor-hosted MCP (e.g. official Stripe MCP)
Secure MCP TunnelStays private; only tunnel-client egressOutbound 443 onlyInternal CRM, DB wrappers, localhost dev servers

At a glance

QuestionAnswer
What ships?tunnel-client agent + OpenAI-hosted tunnel control plane
Who connects?ChatGPT, Codex, Responses API (and AgentKit per README)
Inbound ports required?No—outbound HTTPS from your network
How is work delivered?Long-poll /v1/tunnel/{id}/poll, respond on /response
Where to start?Secure MCP Tunnel guide + tunnel-client help quickstart

Research supplement

Web search was unavailable in this session; no externally sourced claims have been added. The analysis above is based entirely on the article text, the referenced OpenAI documentation and GitHub repository, and prior knowledge of the outbound tunnel pattern and MCP ecosystem.

---

References