Categories
News

Loop Engineering: Design Coding-Agent Systems Instead of Prompting Every Turn

Loop engineering means you stop being the person who types every prompt to a coding agent — and start designing a small system that discovers work, delegates it, checks it, remembers progress, and repeats. The leverage moves from prompt craft to loop design: six primitives that now ship inside tools like Claude Code and the Codex app instead of bespoke bash you maintain forever.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  subgraph Stack["Three layers"]
    H[Harness engineering] --> L[Loop engineering]
    L --> O[Orchestration layer]
  end
  H -->|one agent runtime| T[Tools memory sandbox]
  L -->|schedule + verify| P[Six primitives]
  O -->|fleet + PR lifecycle| R[Reactions state machine]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class H,L,O agent
  class T,P,R hook

Where the conversation landed in 2026

The shift is no longer niche. Boris Cherny, who leads Claude Code at Anthropic, described it on the Acquired podcast as: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figure out what to do. My job is to write loops.” Peter Steinberger put the same idea on X: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Both are saying the human job moved up one floor — from typing each turn to designing feedback systems.

That floor has three names in practice. Harness engineering is the runtime around one agent (tools, memory, permissions). Loop engineering is the harness that runs on a schedule, spawns helpers, and feeds itself from disk. Orchestration is the layer above when you need fleets of agents across worktrees, PRs, and CI — with automatic routing of failures back to the right session.

The universal five-stage cycle

Five stages of a coding agent loop: discover plan execute verify iterate
Every serious loop — single agent or fleet — runs the same cycle until a verifiable stop condition holds.
StageWhat happensTypical tooling
DiscoverFind work: CI failures, issues, diffs, inboxAutomations, /loop, triage skills
PlanBreak goal into steps with constraintsSkills, VISION.md, spec sub-agent
ExecuteEdit code, run tools, open PRsWorktrees, MCP connectors
VerifyPush against objective signals — not model opinionTests, lint, /goal evaluator, critic sub-agent
IterateFix gaps and loop againStop hooks, reactions, state file

A prompt gives instructions for one turn. A loop gives a job: discover → plan → execute → verify → iterate until done. You set the goal; the loop runs itself.

Open loops vs closed loops

Open loopClosed loop
NatureExploratory; wide search spaceBounded path you designed
RiskToken burn; “slop machine” without gatesCheaper; predictable
NeedsLarge budget + strong evaluatorsClear goal, defined steps, stop condition
Start here?Research spikes, benchmarksProduction coding, triage, migrations

Closed loops need five ingredients on disk: goal (precise done), context (VISION.md, ARCHITECTURE.md, RULES.md), action (scoped tools), feedback (tests, lint, structured errors), and a stop condition (/goal text, Stop hook, or orchestrator brief). Without a quality gate, AI drifts; with one, it improves.

Single-agent loop vs fleet loop

Single-agent loopFleet loop
ShapeOne brain runs discover→verify end-to-endOrchestrator splits work across specialists
Good forFocused refactors, /goal migrationsLarge features, parallel PRs, research→build→QA chains
Token profile~50K–200K tokens per medium coding task~500K–2M+ when orchestrator + 3+ specialists run
Example splitExplore → implement → verify sub-agentsResearch specialist → engineering specialist → QA specialist, each with its own loop

What changed in agentic development

For roughly two years, “good AI coding” meant writing strong prompts and feeding enough context each turn. You typed, read, typed again — the agent was a power tool and you held the handle every step.

Loop engineering is the next layer: a recursive goal where you define purpose and done, and the system iterates until a verifiable condition holds. You design once; the loop pokes agents on a schedule or across turns. This sits one floor above agent harness engineering (the environment one agent runs in) and the factory model (the system that builds software) — same family of ideas, but the harness now runs on a timer, spawns helpers, and feeds itself from disk-based memory.

The six primitives every loop needs

Six building blocks of a coding agent loop
Five action primitives plus persistent state — the shape is the same across major coding-agent products.
#PrimitiveJob in the loopWithout it
1AutomationsScheduled discovery and triageYou manually check CI, issues, and diffs
2WorktreesIsolate parallel agent checkoutsTwo agents overwrite the same files
3SkillsProject knowledge on disk (SKILL.md)Agent re-guesses conventions every run
4Connectors (MCP)Issues, DB, Slack, staging APIsAgent only sees the filesystem
5Sub-agentsSeparate maker and checker rolesOne model grades its own homework
6State / memoryMarkdown, Linear board, AGENTS.mdModel forgets between runs; loop restarts blind

The agent forgets; the repo does not. Long-running loops depend on external state — not context window — to remember what was tried, what passed, and what is next. Common context files beyond SKILL.md: VISION.md (what success looks like), ARCHITECTURE.md (stack and layout), RULES.md (forbidden actions), GUARDRAILS.md (always-on checklists), and AGENTS.md (repo map for agents).

Codex app vs Claude Code — same shape, different names

PrimitiveCodex appClaude Code
AutomationsAutomations tab: project, prompt, cadence, local or worktree env; Triage inbox; thread vs standalone runs/loop, Desktop scheduled tasks, Cloud Routines (/schedule), hooks, GitHub Actions
WorktreesBuilt-in per threadgit worktree, --worktree, isolation: worktree on subagents
SkillsSKILL.md, invoke with $name or /skillsSame SKILL.md folder format; bundled /loop, /code-review
ConnectorsMCP connectors + pluginsMCP servers + plugins; routine connectors on claude.ai
Sub-agentsTOML in .codex/agents/.claude/agents/, agent teams
StateMarkdown / Linear via connector; thread memoryAGENTS.md, progress files, prd.json-style task queues

Once you see the shared shape, the debate shifts from “which tool” to “which loop design still works in either seat.”

1. Automations — the heartbeat

Automations turn a one-off agent run into a loop. In the Codex app you configure project, prompt, schedule, and environment (local checkout or background worktree). Runs with findings land in a Triage inbox; empty runs archive themselves. Internal uses include daily issue triage, CI failure summaries, commit briefings, and regression hunts. Automations can call $skill-name so recurring logic stays maintainable.

Claude Code reaches the same outcome via /loop (interval reruns), cron scheduling, lifecycle hooks, Desktop scheduled tasks (persistent while app is open), Cloud Routines (runs when laptop is closed), or GitHub Actions for headless runs.

Interactive pick: /goal vs /loop vs Stop hooks

MechanismNext turn starts when…Stops when…Best for
/goal (Claude)Previous turn finishesSeparate evaluator model confirms condition (reads transcript only)Migrations, refactors, “all tests green”
/goal (Codex)Thread idle after turnEvidence in thread supports completion; pause/resume/clear/budgetMulti-hour tuning, benchmarks, long refactors
/loopTime interval elapsesYou stop it or agent decides donePolling deploys, periodic summaries, PR babysitting
Stop hookPrevious turn finishesYour script, prompt hook, or agent hook decidesRalph-style loops, org-wide completion rules
# Claude Code — run until tests and lint are clean (v2.1.139+)
/goal all tests in test/auth pass and the lint step is clean

# Check spend and evaluator reasoning
/goal

# Stop early
/goal clear

# Headless single invocation
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"

# Codex — long-running performance goal (cookbook pattern)
/goal Reduce p95 checkout latency below 120 ms, verified by the checkout benchmark,
while keeping the correctness suite green. If blocked, stop with evidence.

/goal on Claude Code starts a turn immediately; after each turn Haiku (by default) judges yes/no from the transcript — it does not run tools. Codex /goal is thread-scoped with explicit budget accounting and pause/resume. Pair either with auto mode so each turn skips per-tool confirmations.

2. Worktrees — parallel without collisions

Two agents editing the same file is the same failure mode as two engineers on one branch without coordination. A git worktree is a separate working directory on its own branch, sharing history but not files. Codex threads use worktrees natively; Claude Code offers --worktree sessions and isolation: worktree on subagents that clean up after themselves.

Worktrees remove mechanical collision; your review bandwidth still caps how many parallel agents you can actually supervise.

3. Skills — stop paying intent debt every session

Agents start cold. Every missing convention becomes a confident guess — intent debt. A skill is intent written outside the chat: a folder with SKILL.md, optional scripts, references, and assets. Both Codex and Claude Code load skills when you invoke $name or when the task matches a tight, boring description (clever descriptions match too often).

# Example skill layout
my-project-skill/
  SKILL.md          # conventions, build steps, forbidden patterns
  scripts/
  references/

Skill vs plugin: the skill is the authoring format; a plugin bundles skills and connectors for teammates to install once.

4. Connectors — act in your real environment

MCP connectors let the loop read Linear/Jira, query databases, hit staging APIs, and post to Slack. That is the difference between “here is the fix” and “open the PR, link the ticket, ping the channel when CI is green.” Plugins package connectors with skills so onboarding is one install, not tribal memory.

Feedback signals that keep loops honest

Hierarchy of agent loop feedback signals from tests to self-critique
A loop with nothing to push against is just the agent agreeing with itself — layer deterministic, perceptual, and critic signals.
Signal typeExamplesStrength
Deterministic oraclesCI, unit tests, type checks, linters, git diff, scalar metrics (e.g. benchmark p95)Strongest — pass/fail without model judgment
Perceptual / visualPlaywright, browser MCP tools, layout screenshotsMedium — catches UI regressions code tests miss
Critic sub-agentsSeparate reviewer agent; forces retry or stopMedium — judgment, but not the worker context
Persistent contextGUARDRAILS.md, skills, checklists loaded every runAlways-on oracle
LLM self-critique only“Does this look good?” from same modelWeakest — rationalises its own mistakes

Strongest systems stack multiple signal types: deterministic for reliability, visual/critic for judgment, human gates on high-stakes merges. Signals must route back automatically — full logs, diffs, scores — without you copy-pasting CI output each turn.

5. Sub-agents — maker vs checker

Maker agent and checker agent split in a coding loop
The highest-leverage split: implement in one agent, verify in another — including /goal’s separate done-evaluator.

The model that wrote the code is too lenient grading itself. A second agent — different instructions, sometimes a different model — catches rationalised mistakes. Typical trio: explore, implement, verify against spec. In fleet setups, a validator agent reports truth without fixing — failures loop back to the builder.

# Codex — custom subagent (simplified .codex/agents/security-reviewer.toml)
name = "security-reviewer"
description = "Read-only security pass on diffs"
instructions = "Find auth, injection, and secret-leak risks. No edits."
model = "strong"
reasoning_effort = "high"

Sub-agents cost extra tokens (each runs its own model + tools). Spend them where a second opinion unlocks unattended runs — the only reason you can walk away from a loop.

Orchestration — when one loop is not enough

Single-session /goal loops solve “finish this migration without me re-prompting.” Fleet-scale work needs an orchestration layer: deterministic plumbing plus an orchestrator agent for judgment.

LayerJobExamples
Deterministic plumbingRoute environmental feedback automaticallyCI fail → inject logs into worker session; PR conflict → notify right agent; lifecycle state machine (working → ci_failed → review_pending → merged)
Orchestrator agentDecompose goals, write briefs, batch parallel workResearch agent → spec → tracking issue → N workers in isolated worktrees
Human gatesVision, acceptance, high-risk mergesTriage inbox, PR approval — optimise human time, not remove humans

Open-source reference implementations like Agent Orchestrator (npm install -g @aoagents/ao) ship reactions engines, worktree isolation, and orchestrator prompts out of the box. The pattern: inner agents execute in bounded loops; outer orchestrator coordinates; environmental signals keep loops honest; you stay on vision and judgment.

Walkthrough: one morning triage loop

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
sequenceDiagram
  participant Auto as Morning automation
  participant Skill as Triage skill
  participant State as STATE.md
  participant WT as Worktree
  participant Maker as Fix sub-agent
  participant Check as Review sub-agent
  participant MCP as Connectors

  Auto->>Skill: Run on schedule
  Skill->>State: Write CI failures + issues
  loop Each actionable item
    Auto->>WT: Open isolated checkout
    WT->>Maker: Draft fix
    Maker->>Check: Submit diff
    Check-->>Maker: Approve or reject
    Maker->>MCP: Open PR + update ticket
  end
  Auto->>State: Log done / blocked for human inbox
  • 06:00 — Automation fires; triage skill reads yesterday’s CI, open issues, recent commits.
  • Findings — Written to STATE.md or a Linear board (memory outside the chat).
  • Per item — New worktree → maker sub-agent drafts fix → checker sub-agent runs against project skills + tests.
  • Ship — Connectors open PR and update tickets; blocked items land in your inbox.
  • Tomorrow — State file tells the loop what was tried, passed, or still open.

You designed this once. You did not prompt each step — that is the whole point.

Prompt engineer vs loop engineer

Prompt engineerLoop engineer
Crafts better instructions per turnDesigns feedback cycles and stop conditions
Linguistic skillSystems / software engineering skill
Better single outputReliable verified outcomes across runs
You review manually each timeSystem self-corrects against oracles
You are the feedback loopThe loop is the feedback loop
“Write me a function”“Write → test → fix until green”

Self-check: is your loop healthy?

QuestionHealthy loopLeaky loop
What proves “done”?Tests, lint, measurable condition in /goalAgent says “looks good”
Where does memory live?Repo file or issue trackerOnly in chat context
Who verifies?Separate sub-agent or evaluator modelSame agent that wrote code
What pushes back?Layered oracles (CI + critic + human gate)Self-critique only
Parallelism?One worktree per agentShared checkout
Token budget?Turn cap in condition or manual clearOpen-ended overnight /goal
Your role?Review merged outcomes you understandPress go and hope

What loops do not remove — three sharper risks

Verification stays human

An unattended loop is also an unattended mistake machine. Even with a verifier sub-agent, “done” is a claim, not proof. Ship code you confirmed works — especially when diff sizes balloon because agents touch more files than necessary.

Comprehension debt accelerates

The faster the loop ships code you did not write, the wider the gap between what exists and what you understand. Read the reasoning, skim the diff, trace the decision log — or the loop makes the debt grow faster, not slower.

Cognitive surrender

When automation feels smooth, it is tempting to stop having opinions. Loop design with judgement keeps you the engineer; loop design to avoid thinking is the same UI with opposite outcomes. Two teams can run identical loops — one moves faster on work they deeply understand; the other outsources understanding entirely. The loop cannot tell the difference. You can.

Parallel pattern: scheduled content factories

The same week loop engineering went mainstream for coding, creators published parallel “factory” playbooks for media. @0x_fokki’s X Article I Built an AI Animation Factory That Runs 24/7 is not a coding-agent harness — Claude is used as a scriptwriter, not a repo editor — but it shows the same design move: stop hand-driving each step, design a pipeline that runs on a schedule with human approval gates.

Coding loop and content factory share the same scheduled pipeline shape
Same loop instinct in two domains — you design the system and the gates, not every intermediate prompt.

Fokki’s pipeline chains six tools end-to-end:

Claude → Midjourney → Runway → ElevenLabs → Suno → Make
script → frames → motion → voice → music → publish

One Make scenario runs Monday and Thursday at 08:00: pull scripts from Google Drive, batch Midjourney scene prompts, download frames, send dialogue to ElevenLabs, pair images with Runway motion clips, assemble in a CapCut template, upload to YouTube with generated metadata, clip a 30-second X preview, post Patreon early access, and ping Telegram on completion. A separate on-demand webhook turns client briefs into finished explainers in shared Drive — quoted turnaround ~6 hours after a one-time ~5-hour setup.

Four SKUs share the pipeline: animated story series (6–10 min), brand explainers (60–90 sec), motion comics, and children’s bedtime channels. The human job is narrow: pick the story, pick the style, approve the output — roughly four hours of direction for a “24/7” factory, per the author.

Loop-engineering primitiveFokki factory analogueKey difference
AutomationsMake.com schedule + webhookNo /goal or hooks — cron-style triggers only
Skills / context on diskReusable Midjourney character sheets, CapCut templates, voice cast notesCreative consistency prompts, not SKILL.md
Sub-agent splitTool specialization per stage (script vs frames vs motion)No verifier sub-agent — human approves final cut
ConnectorsDrive, YouTube, Patreon, Telegram APIsDistribution stack, not MCP issue trackers
Feedback signalViews, RPM, client acceptanceBusiness metrics — not CI, lint, or test gates
State / memoryOrganised Drive folders per episodeAsset library, not AGENTS.md

What transfers to coding loops

  • Scheduled heartbeat — the factory does not wait for you to open a chat; neither should triage or CI-repair loops.
  • Stage-specialised tools — one brain trying to script, illustrate, animate, and score is the creative version of one agent grading its own code.
  • Performance direction in prompts — Fokki writes ElevenLabs stage direction (pauses, volume drops), not raw dialogue paste; coding loops need equally explicit done conditions in /goal text.
  • Human gate on output — “approve the episode” maps to Triage inbox review and PR merge — optimise human time, do not remove judgment.
  • Setup once, run indefinitely — the Make scenario is the media equivalent of wiring automations + skills once, then letting the loop compound.

Treat revenue figures in social factory posts as illustrative, not audited benchmarks. The architectural lesson is stable: factories — code or content — are designed loops with explicit stages, schedules, and gates. Coding loop engineering just demands harder oracles (tests, type checks, diffs) because “shipped” is easier to fake than “sounds convincing.”

Token economics and balance

PatternApproximate token loadMitigation
Single-agent medium coding loop50K–200K per runTurn caps in /goal; cheaper model for explore/review
Fleet (orchestrator + 3 specialists)500K–2M+ per cycleBatch only parallelisable work; stuck detection
Scheduled daily automationMillions per week if always-onArchive empty runs; scope skills tightly
Sub-agents + /goal evaluatorMultiplicative per child sessionSpend sub-agents on high-risk paths only

Loops are not free — patterns diverge wildly if you are “token rich” vs “token poor.” Direct prompting still matters for ambiguity and architecture. Loops handle repetition; you handle judgement. The leverage point moved — it did not disappear.

Performance summary

DimensionPrompt eraLoop era
Your jobWrite each turnDesign discover → plan → execute → verify → remember
Core cycleAsk → answerFive stages until verifiable done
PrimitivesContext + prompt6 shared building blocks (both major tools)
Done signalYou decide to stop/goal evaluator, Stop hook, or environmental oracles
ScaleOne threadWorktrees + sub-agents + orchestration layer
FeedbackYour eyesLayered oracles — not self-critique alone
KnowledgeRe-explained each sessionSkills + VISION.md / AGENTS.md compound
Risk profileSlower, more oversightFaster, higher verification + comprehension debt
Bottom lineBuild the loop — stay the engineer who reviews what ships

Research supplement

The following documentation pages from the official Claude Code docs provide additional technical depth beyond the article's reference links:

  • Scheduled Tasks (/loop): The Scheduled Tasks reference details how /loop works alongside cloud Routines and Desktop scheduled tasks, including the full comparison table of scheduling options, jitter behaviour, seven-day expiry, and the loop.md customisation mechanism. Notably, dynamic /loop schedules can use the Monitor tool internally to stream background process output, avoiding polling entirely.
  • Agent Loop Architecture: The Agent SDK: How the agent loop works page documents the full turn-and-message lifecycle, context window management, automatic compaction, and how max_turns / maxBudgetUsd bounds apply. It also explains how subagents start with a fresh conversation context, which has direct implications for keeping loop context efficient over long runs.

Key technical detail not in the primary reference links: The /goal command is implemented as a session-scoped prompt-based Stop hook. This means developers who need evaluation logic beyond a short text condition (for example, running an actual script to verify state) can write a custom Stop hook instead — which gives them the same turn-by-turn evaluation model with full scripting power.

---

References

Categories
News

Anthropic Doubles Claude Cowork 5-Hour Limits Through July 2026

Anthropic doubled Claude Cowork’s five-hour session rate limits for Pro, Max, and Team subscribers from 5 June through 5 July 2026, leaving weekly caps and the shared quota across Claude products unchanged.

FieldDetail
DateAnnounced 5 June 2026; promotion through 5 July 2026
VendorAnthropic
ProductClaude Cowork (desktop knowledge-work agent)
AvailabilityClaude Pro, Max, and Team paid plans; Cowork only—not Claude Code or chat-specific boosts
Pricing / limits2× five-hour rolling session allowance; weekly usage cap static; quota shared with Claude.ai and Claude Code

What changed

  • Boris Cherny, who leads Claude Code at Anthropic, announced the promotion on 5 June 2026 via social post—no dedicated article appeared on the Anthropic newsroom index by 9 June 2026.
  • Claude Cowork five-hour rolling session limits are doubled for approximately one month, ending 5 July 2026.
  • Eligible plans: Claude Pro, Claude Max, and Claude Team.
  • The change applies to five-hour rate-limit windows only—Anthropic’s weekly usage cap is unchanged.
  • Claude Code and Claude.ai retain standard session limits; the promotion is Cowork-specific.
  • Subscription quota remains a shared pool across Claude surfaces—heavier Cowork bursts can still exhaust the weekly budget faster.

Why it matters for engineers

Anthropic meters paid plans with two leaky buckets: a five-hour rolling session window for burst fairness and a weekly cap for cost control. Doubling only the first bucket optimises long desktop agent runs—folder reorganisation, batch report generation, scheduled digests—without raising Anthropic’s weekly compute exposure. Teams scheduling Cowork jobs should treat the promotion as session headroom, not unlimited capacity.

Cowork is not the Claude API. It runs in the desktop app with filesystem and Office integration, autonomous loops, and user approval gates—ideal for knowledge-worker delegation, unsuitable for production services. Engineers should keep CI and production agents on API metering while pilots use Cowork inside the promo window for deferred “messy folder” projects Cherny highlighted.

Unified quota across Cowork, Claude Code, and web chat means platform leads need allocation policy. A seat running heavy Code sessions the same week as a doubled Cowork migration may hit the unchanged weekly ceiling before the session window resets. Monitor Settings → Usage for both progress bars before kicking off multi-hour agent tasks.

Enterprise admins already manage Cowork feature access and org spend caps separately from consumer tiers. Communicate the 5 July revert date so programme managers do not assume permanent 2× session limits in capacity plans.

Doubled five-hour Cowork usage window for Pro Max and Team plans

Anthropic doubled the five-hour Cowork usage bucket for eligible paid plans from 5 June through 5 July 2026 whilst leaving weekly caps unchanged.

Limit windows over the promotion

flowchart TB
  START["5 Jun 2026 promo starts"]
  SESSION["Five-hour rolling window resets continuously"]
  DOUBLE["Cowork session allowance 2x"]
  WEEKLY["Weekly cap unchanged"]
  SHARED["Shared pool: Cowork chat and Code"]
  END["5 Jul 2026 promo ends"]
  START --> DOUBLE
  DOUBLE --> SESSION
  SESSION --> SHARED
  SHARED --> WEEKLY
  WEEKLY --> END
  classDef agent fill:#8B0000,color:#fff
  classDef tool fill:#189AB4,color:#fff
  class DOUBLE agent
  class WEEKLY tool

Timeline view: session windows roll continuously and temporarily widen for Cowork; the weekly ceiling and cross-product pool stay fixed.

Research supplement

Web search and page fetch tools were not available during this session. No additional reputable sources beyond those provided by the author could be verified. The sections above draw exclusively on the article text and the three reference URLs supplied (claude.com/product/cowork, support.anthropic.com/en/articles/9797557-usage-limit-best-practices, claude.com/pricing).

References

Categories
News

Microsoft 2026 Work Trend Index: How Frontier Firms Orchestrate Human-Agent Teams

Microsoft’s 2026 Work Trend Index gives engineering leaders a vocabulary for human–agent collaboration and ships Copilot Cowork mobile, plugins, and Agent 365 so Frontier Firms can orchestrate work across Microsoft and third-party systems.

FieldDetail
Date5 May 2026 (report and product wave); third-party Cowork plugins from 12 May 2026
VendorMicrosoft
Product2026 Work Trend Index; Microsoft 365 Copilot; Copilot Cowork; Microsoft Agent 365
AvailabilityWTI report on WorkLab; Cowork on iOS and Android; native Fabric and Dynamics 365 plugins GA; federated connectors GA (HubSpot, LSEG, Moody’s, Notion)
Pricing / limitsReport is free; Copilot stack via existing M365 Copilot and E7 SKUs—no new price point in this release

What changed

  • Microsoft named four collaboration patterns—Author, Editor, Director, and Orchestrator—and argued leaders must match workstreams to the right pattern rather than defaulting every process to multi-agent orchestration.
  • The 2026 Work Trend Index analysed trillions of anonymised Microsoft 365 signals and surveyed 20,000 AI-using knowledge workers across ten countries (February–April 2026).
  • 49% of sampled Copilot chats support cognitive work; 58% of AI users produce work they could not a year ago, rising to 80% among Frontier Professionals.
  • Microsoft described a Transformation Paradox: 65% fear falling behind without AI, yet 45% prefer current goals over redesigning work, and only 13% feel rewarded for AI-driven reinvention.
  • Organisational factors—culture, manager support, talent practices—account for more than twice the reported AI impact of individual mindset (67% vs 32%).
  • Respondents map to five readiness zones: Frontier (19%), Blocked Agency (10%), Unclaimed Capacity (5%), Stalled (16%), and Emergent (50%).
  • Copilot Cowork Mobile launched on iOS and Android; native plugins for Dynamics 365 and Fabric are GA, with partner plugins (LSEG, Miro, monday.com, S&P Global Energy) rolling out.
  • Custom plugins let organisations codify internal workflows; federated Copilot connectors are GA in Researcher and Microsoft 365 Copilot Chat.
  • Microsoft Agent 365 is the control plane for governing, observing, and securing agents at scale, including visibility into local agents.

Why it matters for engineers

Platform teams often ship agents without changing incentives. The WTI data suggests most adoption friction is organisational, not model quality—skilled builders frequently land in Blocked Agency zones where legacy metrics punish workflow redesign. Pair agent rollouts with evaluation criteria that reward reinvention, not only throughput.

The four-pattern ladder is a practical safety taxonomy. Author and Editor modes suit low blast-radius tasks with human review on every artefact. Director mode needs job isolation, rollback, and audit trails. Orchestrator mode demands a control plane—Agent 365 in Microsoft’s stack—for connector scopes, identity, and exception routing. The same framing applies whether you build on Copilot or run Claude Code beside it.

Cowork’s plugin and connector model is the integration surface to design for: native first-party data (Fabric, Dynamics), packaged partner actions, and custom plugins for proprietary expertise. Federated connectors let agents read external knowledge without migrating data. That graph-of-connectors pattern is portable beyond M365.

Frontier Professionals—multi-step agent users who redesign workflows and publish team standards—are a benchmark for internal playbooks. They pause to allocate human versus AI work, deliberately practise skills without AI, and treat model output as draft material. Telemetry showing 49% of Copilot use in cognitive tasks suggests backlog priority belongs in analysis and synthesis features, not generic chat wrappers.

Human-agent operating model shift in Frontier Firms

Frontier Firms redesign work around human–agent teams: people set goals and own accountability whilst agents execute repeatable analysis and orchestration.

Readiness zones at a glance

flowchart LR
  subgraph lowOrg["Low organisational readiness"]
    ST["Stalled 16%"]
    EM["Emergent 50%"]
  end
  subgraph highOrg["High organisational readiness"]
    UC["Unclaimed capacity 5%"]
    FR["Frontier 19%"]
  end
  subgraph indiv["Individual capability"]
    LO["Low"]
    HI["High"]
  end
  BA["Blocked agency 10%"]
  HI --> BA
  BA --> lowOrg
  FR --> highOrg
  HI --> FR
  LO --> ST
  classDef agent fill:#8B0000,color:#fff
  classDef tool fill:#189AB4,color:#fff
  class FR agent
  class BA tool

Matrix view: Frontier sits where individual skill and organisational support reinforce each other; Blocked Agency is the engineering-heavy zone where talent outruns incentives.

Research supplement

Web search and external page fetches were not available during this session (permissions not granted), so no additional sources could be verified. The following are factual claims from the article that would benefit from independent corroboration if this supplement is expanded in a future pass:

  • The 67% vs 32% organisational/individual split — the WTI methodology appendix (available at aka.ms/2026WorkTrendIndexAnnualReport) should be consulted to confirm how these figures were derived from the survey data.
  • Agent 365 GA and Microsoft 365 E7 SKU details — pricing and availability can be verified against the Tech Community announcement at the reference URL provided by the author.
  • Federated connector GA status — HubSpot, LSEG, Moody's, and Notion connector availability can be confirmed via the Microsoft 365 Copilot release notes.

References

Categories
News

Apple Private Cloud Compute on Google Cloud: NVIDIA GPUs with Verifiable Privacy

Apple is extending Private Cloud Compute to Google Cloud NVIDIA GPU clusters so the heaviest Apple Intelligence workloads can run on third-party infrastructure without abandoning stateless, attestable privacy guarantees.

FieldDetail
Date9 June 2026 (Apple Security Research blog)
VendorApple — hosted on Google Cloud with NVIDIA and Intel silicon
ProductPrivate Cloud Compute (PCC) on Google Cloud for Apple Intelligence cloud inference
AvailabilitySummer 2026 preview with gradual ramp to full protection set; further detail at Confidential Computing Summit and in an updated PCC Security Guide
Pricing / limitsConsumer Apple Intelligence feature (no public API); security researchers gain binary inspection and bounty-programme access to research-mode nodes

What changed

  • PCC leaves Apple-only data centres. For the first time, Apple Intelligence cloud inference runs on Google Cloud systems, whilst Apple retains cryptographic control over which PCC software builds devices will trust.
  • New hardware trust stack. The implementation combines NVIDIA Confidential Computing GPUs, Intel CPUs with Trust Domain Extensions (TDX), and Google’s Titan security chip — replacing the Apple-silicon-only hosts used since PCC launched in 2024.
  • Foundation model collaboration. Apple worked with Google to apply Gemini-family techniques when building next-generation Apple Foundation Models; on-device tiers still handle lighter tasks, but agentic tool-use and complex reasoning target the cloud tier on NVIDIA hardware.
  • Supply-chain and attestation hardening. Apple maintains a cryptographically verifiable, append-only ledger of every Google Cloud machine in the PCC fleet. Components that could exfiltrate data if compromised are attested with at least two independent vendor roots of trust.
  • Architectural patterns carry over. Initial request parsing runs in a dedicated namespaced process; shared inference processes recycle on a short time-to-live; attested keys live in a separate confidential VM isolated from external inputs.
  • Transparency programme unchanged. PCC binaries remain published for public inspection, with research tooling and live research-mode nodes offered through the Apple Security Bounty Programme.

Why it matters for engineers

Confidential VMs and GPU encryption are now commodity cloud options. Apple’s claim is different: those primitives have not, until now, been composed into an end-to-end confidential inference pipeline that also ships public binaries and bounty-grade verification at global scale. PCC on Google Cloud is a reference for treating the entire stack — firmware through application code — as the trusted computing base, rather than trusting the guest VM boundary alone.

Platform teams building multi-tenant AI should study the operational patterns, not only the silicon. Stateless computation is enforced through short-lived inference workers and isolated parsers, reducing the blast radius if a host is misconfigured. Hardware inventory ledgers matter when you neither manufacture servers nor operate the facility: they convert supply-chain risk into auditable state. Dual roots of trust make it harder for a single vendor compromise to forge the entire attestation story.

For Apple Intelligence client engineers, the device-side contract is stable: only Apple-cryptographically-approved PCC releases execute, regardless of whether inference lands on Apple metal or a Google Cloud A3-class confidential GPU node. Preview ramp during summer 2026 means protection depth may converge over weeks — plan feature flags and telemetry accordingly until Apple declares parity with Apple-data-centre PCC.

Security researchers should watch the Confidential Computing Summit session and the forthcoming PCC Security Guide update for attestation quote formats, research-node access mechanics, and fleet geography. Until then, treat this announcement as architectural intent with preview availability, not a finished open inference API.

Apple PCC privacy envelope extended to Google Cloud NVIDIA confidential compute

Apple Private Cloud Compute extends its privacy envelope to Google Cloud nodes using NVIDIA confidential GPUs, Intel TDX, and Titan-backed attestation.

flowchart LR
    DEV["Apple device"]
    TRUST["Apple-approved PCC client"]
    NODE["Confidential cloud node"]
    GPU["Stateless GPU inference"]
    RESP["Encrypted response"]

    DEV --> TRUST
    TRUST --> NODE
    NODE --> GPU
    GPU --> RESP
    RESP --> DEV

    classDef agent fill:#8B0000,color:#fff
    classDef tool fill:#189AB4,color:#fff
    class NODE,GPU tool
    class DEV,RESP agent

References

Categories
News

Amazon Bedrock EU Cross-Region Inference: GDPR-Aligned Model Routing for Engineers

Amazon Bedrock now documents EU geographic cross-region inference profiles so teams in Europe can pool model capacity across Union Regions whilst keeping prompts and outputs inside a fixed EU routing boundary.

FieldDetail
Date26 May 2026 (AWS Machine Learning blog)
VendorAmazon Web Services
ProductAmazon Bedrock — Cross-Region Inference (CRIS), EU system-defined inference profiles
AvailabilityCommercial Bedrock Regions; EU profiles route only to EU destination Regions (with London and Zurich source exceptions per AWS rules)
Pricing / limitsNo separate routing fee; billed from source Region; global profiles offer ~10% savings on some models; inference profiles do not support Provisioned Throughput

What changed

  • Inference profile IDs replace plain model IDs. Applications opt into CRIS by passing system-defined profile strings such as eu.amazon.nova-2-lite-v1:0 (EU geographic) or global.amazon.nova-2-lite-v1:0 (global commercial) to Converse, InvokeModel, streaming APIs, batch jobs, Agents, and knowledge-base generation.
  • EU geographic profiles constrain destination Regions. All destinations in EU CRIS lie within the European Union. Requests from EU sources cannot be routed to non-EU commercial Regions whilst using an eu.* profile.
  • London and Zurich are special-cased. Sources in eu-west-2 may route among EU Regions plus London; eu-central-2 sources among EU Regions plus Zurich. Non-EU sources using EU profiles are optimised across the source Region and EU destinations only.
  • Geographic profile Region lists are static. AWS will publish a new inference profile ID rather than silently expanding an existing EU geography definition.
  • Audit fields ship in CloudTrail. Invocation metadata is logged in the customer source Region; additionalEventData.inferenceRegion records where Bedrock actually processed the request. Optional Model Invocation Logging keeps full payloads in the source Region only.
  • Compliance framing is explicit. The post ties CRIS to GDPR records-of-processing expectations, IAM least privilege, and Amazon Bedrock’s inclusion in the CISPE Data Protection Code of Conduct.

Why it matters for engineers

EU SaaS teams no longer choose between single-Region throttling and unaudited multi-Region sprawl. EU CRIS is a deliberate contract: your SDK client stays in a familiar source Region, but Bedrock may execute inference in another EU Region selected for capacity. Inter-Region traffic remains on the AWS private backbone with encryption in transit — a detail that matters when security reviewers ask whether prompts leave controlled networks.

The integration surface is small; the governance surface is not. IAM policies for geographic CRIS must grant bedrock:InvokeModel on the inference profile and on foundation-model ARNs in every destination Region listed for that profile, often conditioned on bedrock:InferenceProfileArn. Service Control Policies that block any destination Region in the profile will fail requests even when the source Region is allowed. Cross-Region inference can also target Regions you have not manually enabled — SCP design must allow the full destination set.

Operational teams should dashboard inferenceRegion alongside application metrics. That field supports data-protection impact assessments without enabling payload logging. When maximum throughput or ~10% cost savings outweigh residency constraints, global.* profiles remain available — but that is an explicit product decision, not a framework default.

Discover profiles via the Bedrock console cross-Region inference page, per-model Regional availability tables in the user guide, or list_inference_profiles(typeEquals='SYSTEM_DEFINED') from your source Region. Treat profile choice as architecture documentation: EU geographic for GDPR-aligned processing, global for performance-first workloads with accepted cross-border inference risk.

EU geographic inference profiles routing Bedrock requests within Union Regions

EU geographic Bedrock inference profiles keep prompts and outputs inside Union Regions whilst pooling capacity across EU destination Regions.

flowchart LR
    APP["App in source Region"]
    API["Bedrock runtime API"]
    ROUTER{"CRIS profile router"}
    DEST["Destination Region inference"]
    RET["Response to source Region"]

    APP --> API
    API --> ROUTER
    ROUTER --> DEST
    DEST --> RET
    RET --> APP

    classDef agent fill:#8B0000,color:#fff
    classDef tool fill:#189AB4,color:#fff
    class ROUTER tool
    class APP,RET agent

Research supplement

Web search was unavailable during production of this supplement; no additional external sources could be independently verified for this article. The CISPE Data Protection Code of Conduct certification status for Amazon Bedrock, referenced in the article, should be confirmed directly via the CISPE public register at cispe.cloud. The adequacy decision status for the UK and Switzerland under GDPR Article 45 — relevant to the London and Zurich source-Region edge cases — should be confirmed against current European Commission adequacy decisions, as adequacy status can be revoked or amended.

References

Categories
News

MiMo-V2.5-Pro-UltraSpeed: 1T Model at 1000 Tokens Per Second on Commodity GPUs

Xiaomi MiMo and TileRT shipped MiMo-V2.5-Pro-UltraSpeed, a trillion-parameter API tier that sustains roughly 1000 tokens per second decode on a single eight-GPU commodity node—aimed at agent builders who need frontier-scale models inside realtime loops.

FieldDetail
Date8 June 2026
VendorXiaomi MiMo + TileRT
ProductMiMo-V2.5-Pro-UltraSpeed API and trial chat
AvailabilityApplication window 9–23 June 2026 (Beijing time); API at platform.xiaomimimo.com/ultraspeed
Pricing / limits~3× MiMo-V2.5-Pro API price; ~10× decode speed; Token Plan not supported; chat trial capped (10 queues/day, 30 min/session)

What changed

  • 1000+ tps on 1T MoE. Xiaomi claims the first public trillion-parameter decode above 1000 tokens per second using one standard eight-GPU server, via model–system co-design rather than custom wafer or SRAM-only silicon.
  • Selective FP4 on experts. MoE expert matrices quantise to FP4 (MXFP4) with quantisation-aware training; routers and attention stay higher precision to protect reasoning and code quality versus naive full-model FP4.
  • DFlash speculative decoding. Block-level masked parallel drafting replaces serial draft-token generation; reported acceptance lengths reach ~6.3 (coding), ~5.6 (maths/reasoning), and ~4.3 (agent) tokens per verification round with block size eight.
  • TileRT ultra-low-latency stack. Persistent engine kernels and warp-specialised pipelines cut microsecond execution gaps that dominate at kilohertz decode rates.
  • Open weights. Hugging Face release XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash ships FP4 weights plus DFlash draft parameters for offline study.
  • Gated trial. Approved users get free chat at ultraspeed.xiaomimimo.com during the promotion; enterprise partnerships via business-mimo@xiaomi.com.

Why it matters for engineers

Latency redefines what a trillion-parameter model can do. Below roughly ten tokens per second, 1T MoE models sit behind batch jobs and human-tolerated waits. Near 1000 tps, the same weights can participate in parallel Best-of-N search, sub-minute codegen sessions, or millisecond think–act loops in trading, fraud, and clinical triage—without downsizing to a 70B shortcut model.

The architectural lesson is co-design: bandwidth-bound expert matmuls shrink with FP4, serial decode expands via DFlash acceptance, and TileRT removes per-operator launch tax. Teams self-hosting open weights can benchmark the HuggingFace checkpoint on vLLM or SGLang; teams buying API capacity should measure cost per successful agent task during the June trial, not headline tokens per dollar alone.

Treat UltraSpeed as a latency SKU on MiMo-V2.5-Pro, not a new foundation family. Trial pricing and slots end 23 June 2026 unless extended; plan production fallbacks if FP4 quality drifts on your longest agent traces.

FP4 and DFlash accelerating trillion-parameter MoE decode on commodity GPUs

MiMo UltraSpeed stacks FP4 expert quantisation, DFlash speculative decoding, and TileRT persistent GPU pipelines to deliver roughly 1000 tokens per second from a one-trillion-parameter MoE on commodity hardware.

flowchart LR A[Agent request] –> B[MiMo-V2.5-Pro 1T MoE] B –> C[FP4 expert matmuls] B –> D[DFlash draft block] C –> E[TileRT persistent kernels] D –> E E –> F[~1000 tps token stream]

Research supplement

Web search was unavailable during this drafting session. No external sources could be verified. Recommend checking the following primary sources directly for corroboration: the TileRT technical post at tilert.ai detailing the kernel architecture and benchmark methodology; the Hugging Face model card for XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash for QAT methodology and reported eval scores; and the OCP Microscaling Formats specification for MXFP4 format details. Any third-party reproduction benchmarks on vLLM or SGLang that emerge after 9 June 2026 would materially strengthen or challenge the throughput claims.

References

Categories
News

Open-Weight AI Release Week: 25+ Models Across LLMs, Image, Audio, Video, and 3D (June 2026)

Early June 2026 delivered one of the densest open-weight release windows on record — spanning chat models, image generation, speech, music, vision, video, and 3D. The roundup below maps 25+ notable drops across modalities, with specs drawn from official model cards and repos rather than hype alone.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  W[Open-weight release week] --> L[LLMs and MoE chat]
  W --> I[Image DiT checkpoints]
  W --> A[Audio TTS and ASR]
  W --> V[Vision VLMs and OCR]
  W --> M[Music and realtime audio]
  W --> X[Video world and 3D]

  L --> D[Deploy: MLX ONNX vLLM]
  I --> D
  A --> D
  V --> D
  M --> D
  X --> D

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class W agent
  class L,I,A,V,M,X,D hook
Open-weight AI releases grouped by modality for one busy week
Release density by modality — LLMs, image, audio, vision, video, and 3D all shipped open weights in the same window.

Large language models and edge chat

ModelOrgKey specsWhy it matters
Nemotron 3 UltraNVIDIA550B hybrid Mamba–MoE; 55B active; 1M context; 89.1 MMLU; NVFP4 variant ~5× throughput on BlackwellFirst openly weighted 550B hybrid Mamba–Transformer; datacenter agentic scale with ~10% active params
Gemma 4 12BGoogleEncoder-free any-to-any (text/image/audio/video); 256k context; 140+ languages; AIME 2026 77.5; 23-checkpoint QAT wave (mobile ONNX + MLX)Most deployable multimodal open model of the week — laptop-class with Apache 2.0 weights
LFM2.5-8B-A1BLiquid AIEdge MoE; ~1.5B active; 128k ctx; MATH500 88.8; MLX-readyStrong on-device math/reasoning per active parameter
Mellum2-12B-A2.5B-ThinkingJetBrainsFirst open JetBrains MoE; 2.5B active (8 of 64 experts); 131k ctx; LiveCodeBench v6 69.9; Apache 2.0Near–Qwen3-14B coding quality at much lower active width for IDE/agent tooling

Links: Nemotron 3 Ultra · Gemma 4 12B · LFM2.5-8B-A1B · Mellum2 Thinking

Image generation — Ideogram 4 open weights

The surprise headline: Ideogram 4 shipped its first-ever open weights — a 9.3B flow-matching Diffusion Transformer (DiT) trained from scratch. Reported leaderboard placement: #2 overall behind GPT Image 2 on aggregate arenas, top open-weight on Design Arena and LMArena, with particular strength on text-rich layouts (posters, UI mockups, labelled diagrams).

PropertyIdeogram 4 open
Architecture9.3B DiT, flow matching, native 2K
Structured promptsJSON with bounding boxes and colour palettes
WeightsGated on Hugging Face (ideogram-ai/ideogram-4-nf4, FP8 variants)
License splitApache 2.0 code; non-commercial weight agreement (commercial path via Ideogram)

Link: Ideogram 4.0 technical blog · Hugging Face collection

Audio, speech, and music — four TTS labs in one week

ModelOrgHighlights
Higgs Audio v3 TTS 4BBoson AI100+ languages; inline emotion/style/prosody tags; singing/whisper/shout; sub-second time-to-first-audio; 8-codebook AR decoder + 24 kHz output
dots.ttsrednote hilab2B fully continuous AR TTS — no discrete codec tokens; 48 kHz AudioVAE; Qwen2.5-1.5B backbone; Apache 2.0
Magenta RealTime 2GoogleReal-time music generation; <200 ms latency; text + audio + MIDI conditioning; community PyTorch port with live ZeroGPU demos within hours
Nemotron-3.5 ASRNVIDIA600M streaming ASR; 17× more concurrent streams vs Parakeet RNNT 1.1B in NVIDIA benchmarks

Links: Higgs Audio v3 · dots.tts · Magenta RealTime 2 · Nemotron ASR via NVIDIA HF

Vision, VLMs, and document AI

ModelOrgHighlights
Step-3.7-FlashStepFun198B sparse MoE VLM; ~11B active; SWE-Bench PRO 56.3; Apache 2.0
PaddleOCR-VL-1.6PaddlePaddleSOTA document parsing at 1B params; Apache 2.0
NAVABaidu6.3B joint audio–video generation; strong A/V sync in reported evals; Apache 2.0

Video, world models, and 3D

ModelOrgHighlights
Cosmos3-SuperNVIDIA64B physical-AI omnimodel (32B reasoner + 32B generator); couples action trajectories with video+audio gen; OpenMDW 1.1 on Hugging Face
JoyAI-EchoJDUp to 5-minute multi-shot text-to-video on LTX-2.3 stack
Bernini-RByteDanceOpen video/reconstruction line (companion to VAST releases)
VAST TripoSplatByteDance VASTSingle-image → 3D Gaussian splats; MIT license

Links: Cosmos3-Super · NVIDIA Cosmos 3 blog · nvidia/Cosmos

Glossary — abbreviations from the roundup

TermMeaning
MoEMixture-of-Experts — sparse activation; only a subset of parameters run per token (powers many frontier chat models)
QATQuantization-Aware Training — train so weights compress cleanly to INT4/FP8 for phones and laptops
MMLUMassive Multitask Language Understanding — broad knowledge benchmark for LLMs
ONNXCross-platform model format common in production inference
MLXApple’s framework for running models on M-series chips
DiTDiffusion Transformer — transformer backbone inside modern image/video generators

How to navigate the flood

  • Laptop / phone: Gemma 4 12B QAT, LFM2.5-8B MLX, Mellum2 for coding agents.
  • Design & posters: Ideogram 4 open DiT for text-heavy layouts.
  • Voice products: Higgs v3 for expressive tags; dots.tts for fully continuous Apache 2.0 pipelines.
  • Robotics / sim: Cosmos3-Super for multimodal world + action reasoning.
  • Datacenter LLM: Nemotron 3 Ultra + NVFP4 on Blackwell for throughput.

Watch the week in 60 seconds

Video: demonstration from Niels Rogge on LinkedIn.

Release-week summary

MetricValue
Notable open-weight drops25+ across modalities
Largest LLMNemotron 3 Ultra — 550B total / 55B active
Most deployable multimodalGemma 4 12B — encoder-free, QAT/MLX wave
Biggest image surpriseIdeogram 4 — first open 9.3B DiT weights
TTS breakout4 labs (Higgs, dots.tts, Magenta RT2, Nemotron ASR)
Physical AI flagshipCosmos3-Super — 64B omnimodel
Fastest streaming ASR claimNemotron-3.5 ASR — 17× streams vs Parakeet 1.1B

Research supplement

Web search was unavailable during drafting of this post. The seven highlighted models are grounded in the author's provided reference links (Hugging Face model pages, official blogs, and GitHub repositories). No additional verified external sources could be confirmed for this supplement. Readers wishing to verify benchmark comparisons, licence terms, or capability claims should consult the original Hugging Face model cards and the official blog posts linked in the article's reference section directly.

---

References

Categories
News

INSID3: Training-Free In-Context Segmentation from One Example with Frozen DINOv3

INSID3 is a training-free in-context segmentation method that segments objects, parts, or personalised instances in new images from a single annotated reference — using only frozen DINOv3 features, with no segmentation decoder, fine-tuning, or auxiliary models such as SAM.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  R[Reference image + mask] --> E[Frozen DINOv3 encoder]
  T[Target image] --> E
  E --> D[Positional debias SVD projection]
  E --> C[Agglomerative clustering]
  D --> S[Seed cluster selection]
  C --> S
  S --> A[Cluster aggregation cross + self similarity]
  A --> M[Segmentation mask + CRF refine]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class R,T,M agent
  class E agent
  class D,C,S,A hook

What in-context segmentation asks for

In-context segmentation (ICS) mirrors few-shot prompting in language models: you supply one visual example with a mask, and the system must segment the same concept in unseen images. The paper unifies three settings under one formulation:

SettingPrompt meansExample
SemanticAll instances of a classEvery “dog” in the target
PartSame object part“Dog ear” across poses
PersonalisedSame specific instance“My dog” in new photos

Why prior pipelines needed decoders or SAM stacks

ApproachTypical stackTrade-off
Fine-tuned ICS (SegIC, DiffewS)VFM + trained decoderStrong in-domain; weaker out-of-distribution
Training-free SAM pipelines (Matcher, GF-SAM)DINOv2 + SAM (~945M params)Better generalisation; multi-stage cost and fixed mask priors
INSID3Frozen DINOv3 Large only (304M)No mask supervision; segmentation emerges from dense SSL features

Three-stage pipeline on frozen DINOv3

INSID3 never updates weights. Given reference image Ir with binary mask and target It, it extracts patch embeddings from a frozen DINOv3 encoder, then runs three stages:

  • Fine-grained clustering — agglomerative clustering on original target features (threshold τ = 0.6) yields coherent object- and part-level region candidates without fixing cluster count K.
  • Seed-cluster selection — backward nearest-neighbour filtering in debiased feature space narrows candidates; cross-image prototype similarity picks the seed cluster most aligned with the reference mask.
  • Cluster aggregation — merges the seed with clusters whose combined cross-image × intra-image self-similarity score exceeds α = 0.2, recovering full object extent beyond the seed alone.
INSID3 pipeline overview from the CVPR 2026 paper Figure 3
Paper Figure 3: debias → cluster → seed selection → aggregation on frozen DINOv3.

Unlocking DINOv3: positional debiasing

DINOv3’s dense features carry strong semantic correspondence, but cross-image similarity also shows a systematic positional bias: patches at the same absolute coordinates spuriously match even when semantics differ — especially in low-content background regions. The authors estimate a low-dimensional positional subspace by passing a Gaussian noise image through the encoder, taking the top-s right singular vectors from SVD, and projecting reference and target features onto the orthogonal complement. Debiased features drive cross-image matching; original features drive intra-image clustering and self-similarity aggregation.

Cross-image similarity before positional debiasing in DINOv3
Original DINOv3 similarity: spurious activations align with reference coordinates (paper Figure 4a).
Cross-image similarity after positional debiasing
Debiased features suppress coordinate-driven matches while preserving semantics.

On semantic correspondence (SPair-71k, DINOv3-Base), debiasing lifts PCK@0.10 from 46.8 to 52.6 (+5.8 points) — showing the fix generalises beyond segmentation.

Region grouping without supervision

DINOv3 patch embeddings exhibit strong local consistency: neighbouring patches on the same object or part cluster naturally. Agglomerative merging on original features decomposes the target into structured region candidates that INSID3 then matches and expands — no K-means preset and no SAM mask proposals.

DINOv3 agglomerative clustering producing object and part regions
Paper Figure 2: agglomerative clustering on dense DINOv3 features.

Benchmark results across nine datasets

The paper evaluates one-shot semantic (COCO-20i, LVIS-92i, ISIC, SUIM, iSAID, chest X-ray), part (PASCAL-Part, PACO-Part), and personalised (PerMIS) segmentation. Primary metric: mean IoU (mIoU) on the final mask. INSID3 averages 55.1 mIoU across all nine benchmarks — +7.5 points over prior work on average and +8.1 over the strongest SAM-based training-free baseline (GF-SAM with DINOv3).

MethodEncoderParamsAvg mIoUPerMISX-RayLVIS-92i
INSID3DINOv3304M55.167.078.841.8
GF-SAM + debiasDINOv3 + SAM945M48.854.560.034.6
GF-SAMDINOv2 + SAM945M47.654.151.035.2
MatcherDINOv2 + SAM945M46.063.870.833.0
SegICDINOv2 + decoder310M44.151.834.544.6

Speed and deployment

MethodStackThroughput
MatcherDINOv2 + SAM0.11 FPS
GF-SAMDINOv2 + SAM0.97 FPS
INSID3DINOv3 only3.31 FPS

INSID3 runs about 3.4× faster than GF-SAM and 29.8× faster than Matcher while using roughly fewer parameters than dual-VFM pipelines. Inputs default to 1024×1024; reducing to 768px trades a small mIoU drop for substantially faster inference. Masks are bilinearly upsampled to original resolution with optional CRF refinement.

Run INSID3 locally

# Clone and install (uv or conda — see GitHub README)
git clone https://github.com/visinf/INSID3
cd INSID3
uv sync && source .venv/bin/activate

# Download frozen DINOv3 Large weights into pretrain/
# dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth

from models import build_insid3
from utils.visualization import visualize_prediction_segmentation as visualize

model = build_insid3()  # optional: mask_refiner="crf", image_size=768
model.set_reference("ref_image.jpg", "ref_mask.png")
model.set_target("target_image.jpg")
pred_mask = model.segment()

# Batch eval example
# python inference_segmentation.py --dataset coco --exp-name insid3-coco

Official code: visinf/INSID3 (Apache 2.0). Colab demo and semantic-correspondence inference (`model.match(ref_kps)`) are also documented in the repository. Accepted as a CVPR 2026 Oral.

Performance summary

MetricValue
BackboneFrozen DINOv3 Large (304M)
TrainingNone — inference-only pipeline
Benchmarks9 (semantic, part, personalised)
Average mIoU55.1
Gain vs prior average+7.5 mIoU
Throughput3.31 FPS (vs 0.97 GF-SAM)
Key hyperparametersτ = 0.6, α = 0.2, image size 1024
ReleasePaper + code (March 2026); CVPR 2026 Oral

Research supplement

Web search was not available during this session; no additional reputable sources were retrieved. The paper (arXiv:2603.28480), project page (visinf.github.io/INSID3), and repository (visinf/INSID3) provided by the author are the primary sources. Readers interested in the broader context should check the paper's related-work section for citations to DINOv2 (Oquab et al., 2023), Matcher, GeoAware-SC, HSNet, and PANet — the most relevant prior training-free and few-shot segmentation baselines.

References

Categories
News

Harness-1: 20B Open Search Agent with State-Externalizing RL Harness for Long-Horizon Retrieval

Harness-1 is a 20B open-source search agent that trains with reinforcement learning inside a state-externalizing harness — the environment keeps candidates, curated evidence, verification logs, and deduplicated history so the policy only decides what to search, keep, verify, and when to stop.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  Q[Search query] --> P[20B policy]
  P -->|search curate verify| T[Retrieval tools]
  T --> H[Stateful harness]
  H --> C[Candidate pool]
  H --> E[Curated evidence set]
  H --> V[Verification records]
  H --> R[Budget-aware context render]
  R --> P
  P -->|submit| A[Final evidence set]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class Q,A agent
  class P agent
  class T hook
  class H,C,E,V,R hook

Why append-only transcripts break long-horizon search

Most search agents are trained as policies over growing chat logs: search, read, append, repeat. The model must simultaneously search well and act as memory, note-taker, verifier, and librarian. Patrick Jiang’s launch thread argues RL on that setup optimises recoverable bookkeeping the environment could maintain more reliably — and sparse final rewards rarely say whether failure came from bad search, forgotten evidence, or missing verification.

Structured search workspace versus endless append-only chat transcript
Harness-1 replaces transcript bookkeeping with an explicit search workspace.

What the harness stores vs what the policy decides

Environment-side harnessPolicy-side decisions
Candidate document poolWhat query to run next
Importance-tagged curated setWhich documents to inspect or keep
Compact evidence linksWhich claims need verification
Verification recordsWhen evidence is sufficient to stop
Compressed, deduplicated observationsHow to revise search after failed checks
Budget-aware context rendering

The agent operates over a workspace, not a raw search box. RL then learns a structured interface — search, curate, revisit, verify, submit — instead of surviving an ever-longer transcript.

Training stack and data scale

StageDetail
Base model20B retrieval subagent (gpt-oss-20b family)
SFT899 filtered trajectories
RL3,453 queries
InfraTinker training; Chroma-backed retrieval harness
Prior workBuilds on Chroma Context-1 self-editing search agent line

The thread emphasises that much of the behavioural prior can live in the harness interface — not only in massive task-specific datasets.

Benchmark results across eight retrieval suites

The paper evaluates web, finance, SEC, patent, and multi-hop QA settings (eight benchmarks total). Primary metric: curated recall — fraction of gold evidence documents present in the final curated set. Harness-1 averages 0.730 curated recall and 0.807 trajectory recall.

Average curated recall comparison for Harness-1 versus open and frontier search agents
Average curated recall across eight benchmarks (paper Figure 1 values).
SearcherTypeAvg curated recallAvg trajectory recall
Opus-4.6Frontier0.7640.794
Harness-1 (20B)Open0.7300.807
GPT-5.4Frontier0.7090.752
Kimi-K2.5Frontier0.6470.794
Tongyi DeepResearch 30BOpen0.6160.673
Context-1 (20B)Open0.6030.756
Search-R1 (32B)Open0.2890.289

Harness-1 beats the next strongest open subagent (Tongyi DeepResearch 30B) by +11.4 curated-recall points. Among tested searchers, only Opus-4.6 scores higher on average curated recall; the launch claims Context-1-class cost and latency at near-frontier search quality. Frontier baselines run as zero-shot retrievers under the Context-1 harness in the paper setup.

Transfer and ablations

  • In-domain vs Context-1: +7.9 curated-recall points on source-family benchmarks.
  • Held-out transfer: +17.0 points — the thread’s headline generalisation result.
  • Harness ablation: disabling harness mechanisms changes behaviour (shallower search, less verification, worse curation), not just information availability; BrowseComp+ recall drops ~12.2% relative in reported ablations.

Open release and local serving

# Install and serve with vLLM (Linux, Python 3.11+, CUDA GPU)
git clone https://github.com/pat-jj/harness-1
cd harness-1
uv sync --extra vllm
export HARNESS1_HF_MODEL=pat-jj/harness-1

# Smoke test
uv run python inference/hf_inference.py \
  --model pat-jj/harness-1 \
  --prompt "Briefly describe Harness-1."

# Full BrowseComp+ eval: see docs/run_vllm_browsecompplus.md
# Requires local BrowseComp+ files + Chroma retrieval collection

Weights: pat-jj/harness-1 on Hugging Face. Public eval path documented for BrowseComp+; in-domain web/SEC/patent corpora require building compatible Chroma indexes (Context-1 data-gen pipeline). Metrics exported include recall, trajectory recall, final-answer recall, and precision.

See Harness-1 in action

Video: demonstration from Patrick Jiang on X.

Performance summary

MetricValue
Model size20B
Benchmarks8 (web, finance, patents, multi-hop QA)
Avg curated recall0.730
Gain vs next open subagent+11.4 points
Transfer gain vs Context-1+17.0 points (held-out)
SFT / RL data899 trajectories / 3,453 queries
ReleaseOpen weights + harness code (June 2026)

Research supplement

---

References

Categories
News

PaperBanana Open Source: Multi-Agent Academic Diagrams from Google Research to CLI and MCP

PaperBanana is Google Research’s agentic framework for turning methodology text into publication-ready diagrams and statistical plots; the community paperbanana package (MIT, llmsresearch/paperbanana) ships a full CLI, Python API, Gradio Studio, and MCP server on top of the arXiv paper’s five-agent design.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  S[Method text + caption] --> R[Retriever]
  R --> P[Planner]
  P --> Y[Stylist]
  Y --> V[Visualizer]
  V --> C{Critic}
  C -->|Revise description| V
  C -->|Final image| O[Publication figure]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class S,O agent
  class R,P,Y,V hook
  class C decision

The research problem: illustrations block AI scientists

LLM agents can draft papers and run experiments, but methodology diagrams still demand Illustrator, TikZ, or hours of manual layout. PaperBanana (arXiv:2601.23265, Google Cloud AI Research) formalises the task as mapping source context S and communicative intent C (the figure caption) to an image I, optionally guided by reference examples from a curated set. The official codebase is now branded PaperVizAgent at google-research/papervizagent; the paper name PaperBanana remains the common label.

Five agents: retrieve, plan, style, render, critique

AgentRole
RetrieverVLM ranks reference diagrams by domain and visual topology (pipeline vs architecture)
PlannerIn-context learning from retrieved triplets → detailed textual figure description
StylistApplies auto-synthesised aesthetic guidelines (palette, layout, typography)
VisualizerImage model renders description; Matplotlib code for statistical plots
CriticCompares render to source; emits revised description for next iteration

Phase 1 is linear planning (Retriever → Planner → Stylist). Phase 2 loops Visualizer and Critic for T = 3 rounds by default. Google’s experiments use Gemini-3-Pro as VLM judge and Nano-Banana-Pro for image generation.

Community paperbanana package adds CLI Studio and MCP on top of Google PaperBanana research
The MIT community port adds developer surfaces beyond the research reference implementation.

PaperBananaBench results

Google built PaperBananaBench from NeurIPS 2025 methodology figures: 584 curated samples split into 292 test and 292 reference cases. A VLM-as-a-Judge scores faithfulness, conciseness, readability, and aesthetics against human diagrams.

MethodFaithfulnessConcisenessReadabilityAestheticOverall
Vanilla Nano-Banana-Pro43.043.538.565.543.2
PaperBanana + Nano-Banana-Pro45.880.751.472.160.2
Human reference (baseline)50.050.050.050.050.0

Reported gains vs vanilla image generation: +17.0% overall, with the largest jump in conciseness (+37.2%). Ablations show the Retriever supplies structural patterns, the Stylist boosts conciseness and aesthetics, and the Critic recovers faithfulness lost during styling.

Open-source implementation: what llmsresearch/paperbanana adds

Akshay Pachaar’s community repo is an unofficial MIT implementation (not affiliated with Google). It extends the published pipeline with optional Input Optimizer (context enricher + caption sharpener), 13 bundled reference diagrams, and multi-provider support:

SurfaceCapability
paperbanana generateMethodology diagrams from .txt or PDF (--pdf-pages)
paperbanana plot / plot-batchStatistical plots from CSV/JSON with code-based Visualizer
paperbanana batchYAML/JSON manifest; optional composite stitch (1x3 panels)
paperbanana orchestrateFull paper figure package: figures.tex, captions.md
paperbanana evaluateVLM judge on faithfulness, readability, aesthetics vs human reference
paperbanana studioLocal Gradio UI for diagrams, plots, batch, run browser
MCP + Claude skillsgenerate_diagram, continue_run, evaluate_diagram for Cursor/Claude Code
PaperBanana paper input to diagram output example from the open-source repository
Example from the paperbanana repository: paper text in, methodology diagram out.

Quick start

pip install paperbanana

# Methodology diagram (OpenAI or Gemini via .env)
paperbanana generate \
  --input method.txt \
  --caption "Overview of our framework" \
  --optimize --auto

# Statistical plot
paperbanana plot --data results.csv \
  --intent "Bar chart comparing accuracy across benchmarks"

# Batch manifest
paperbanana batch --manifest figures.yaml --optimize

# Local web UI
pip install 'paperbanana[studio]'
paperbanana studio

Providers include OpenAI (gpt-5.2 + gpt-image-1.5), Azure OpenAI / Foundry, Google Gemini (free tier via AI Studio), and OpenRouter. Install paperbanana[pdf] for PyMuPDF paper ingestion.

When to use diagrams vs plots

Output typeVisualizer pathWhy
Methodology diagramsImage generation modelIcons, layout, and NeurIPS-style aesthetics
Statistical plotsMatplotlib code generationNumerical fidelity beats pure image models
Human diagram polishStyle guidelines + image editPaper reports ~56% aesthetic wins vs originals

Performance summary

MetricValue
Benchmark test cases292 methodology diagrams
Reference set292 NeurIPS 2025 examples
PaperBanana overall score60.2 vs 43.2 vanilla (+17.0%)
Critic iterations (default)3
Community package licenceMIT (PyPI: paperbanana)
Official Google repogoogle-research/papervizagent
Community repo stars1,800+ (June 2026)

Research supplement

Model Context Protocol (MCP) background: The MCP server integration mentioned in the article refers to the Model Context Protocol, an open standard introduced by Anthropic in late 2024 for connecting AI language models to external tools, data sources, and services. MCP defines a client–server architecture in which tools (like PaperBanana) expose capabilities as MCP servers, and AI assistants (like Claude Desktop) act as MCP clients that can call those capabilities natively within a conversation. The official specification and documentation are maintained at modelcontextprotocol.io. Positioning PaperBanana as an MCP server means it participates in this integration ecosystem without requiring custom API wrappers for each AI environment that wants to call it.

Note on article content availability: The full text of the WordPress article and the PaperBanana project page were not accessible at generation time (content returned "Loading…"). The analysis, social posts, and supplementary material above are synthesised from the author-provided references (arXiv paper, GitHub repositories, PyPI listing) and the article title. Any specific benchmark numbers, step-by-step code examples, or screenshots from the original article should be verified against the live post before use in secondary coverage.

---

References