Categories
News

Context Graphs Explained: Why Enterprise AI Needs Decision Traces Not Just Data

Context graphs are the enterprise layer Jaya Gupta and Ashu Garg (Foundation Capital) argue will define the next trillion-dollar software wave — not another model, but a living record of decision traces: who approved what, under which policy, with which precedent, and why it was allowed. Gupta’s X Article (23 Dec 2025, 5.2M+ views) reframes the shift: systems of record capture what happened; context graphs capture why.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
  subgraph SOR[Systems of record]
    CRM[CRM current state]
    ERP[ERP ledger]
    WH[Warehouse snapshots]
  end
  subgraph CG[Context graph]
    DT[Decision traces]
    POL[Policies + exceptions]
    PRE[Precedent links]
  end
  AGT[Agent orchestration] -->|reads| SOR
  AGT -->|emits at commit time| DT
  DT --> POL
  DT --> PRE
  DT -->|searchable why| AGT

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class AGT agent
  class DT,POL,PRE hook
  class SOR decision

Rules vs decision traces

RulesDecision traces
ScopeGeneral policy (“use official ARR for reporting”)Specific case (“used X definition, policy v3.2, VP exception, precedent Z”)
Agent needWhat should happen in generalHow rules were actually applied before
Storage todayDocs, playbooks, configSlack threads, heads, side calls — rarely durable

Agents need both: rules for defaults, traces for organisational memory. Without traces, every exception is re-learned from scratch in Slack every quarter.

What systems of record never captured

GapExampleWhat CRM/ticket shows
Tribal exception logic“Healthcare accounts get +10% — procurement cycles”Final discount only
Precedent“We structured Company X deal same way last quarter”No link between deals
Cross-system synthesisARR + Zendesk escalations + churn Slack thread → escalate“Escalated to Tier 3”
Off-system approvalsVP approves discount on Zoom or in DMFinal price, no approver trail
Decision trace flow from inputs and policies through exceptions and approvers to outcome
A decision trace is the structured path from gathered context to allowed action — not the model’s private chain-of-thought.

Context graph in practice — renewal discount

Foundation Capital’s worked example: a renewal agent proposes 20% discount. Policy caps renewals at 10% unless a service-impact exception is approved. The agent pulls three SEV-1 incidents from PagerDuty, an open “cancel unless fixed” escalation in Zendesk, and a prior renewal thread where a VP approved a similar exception. Finance approves. The CRM stores one fact: 20% discount.

  • With traces — full replay: inputs, policy version, exception route, approver, precedent link
  • Without traces — auditors and future agents see only the outcome
  • Compounding — next similar renewal searches precedent instead of re-debating in Slack

Context graph vs knowledge graph

DimensionKnowledge graphContext graph
UnitEntities and typed relationshipsDecision traces and “why” links
TimeOften current stateTemporal by design — “what was true at decision time”
OriginModeled ontology upfrontEmerges from agent execution paths (PlayerZero: “schema is the output”)
QuestionWhat exists and how entities relate?Why was this allowed and what precedent applies?

Why incumbents struggle to own the layer

PlayerAdvantageBlind spot for traces
CRM / ERP agents (Salesforce, Workday, ServiceNow)Own object dataCurrent-state storage — context at approval time not preserved; siloed per system
Warehouses (Snowflake, Databricks)Historical snapshots via ETLRead path after decisions — “why” already gone at ingest
Agent orchestration startupsExecution path at commit timeCan emit structured traces as first-class records

Capturing decision traces requires being in the write path at decision time — not bolting governance onto exports after the fact.

Three startup paths

PathExampleStrategy
Replace system of recordRegie (AI-native sales engagement)Agent as first-class actor; event-sourced state + policy capture native
Replace moduleMaximor (finance close)Own reconciliation logic; ERP remains ledger
New system of record for decisionsPlayerZero (production engineering)Start as orchestration; replayable lineage becomes authoritative

Observability complements the stack — Arize positioned as Datadog-for-agent-decisions: monitor, debug, and evaluate agent behaviour as traces accumulate.

Signals for where to build

  • High headcount — 50+ people routing tickets, triaging, reconciling manually
  • Exception-heavy decisions — deal desks, underwriting, compliance, escalations where “it depends” is honest
  • Glue functions — RevOps, DevOps, SecOps exist because no single SoR owns cross-functional workflow

One month in — what resonated and what pushed back

Foundation Capital’s January 2026 follow-up reports Fortune 500 CIO inbound, portfolio builds (Maximor, PlayerZero, Tessera, Tonkean, Regie), and broad founder agreement: decision traces compound while models commoditise.

ThemeDetail
Glue function problemTribal knowledge (“suppress that finding because of WAF”) lives in people, not tickets
Pushback: capture the why?True intent is internal — capture the how (policy, evidence, approver) and infer patterns
Bitter lesson objectionOrg policies aren’t suboptimal heuristics to discard — agents must act consistently with them
Open: category or feature?Layer is necessary; standalone vs absorbed into warehouse/catalog TBD
Build practiceStart narrow — one workflow; let ontology emerge from trajectories
Time decayPrecedent half-life — pricing exception under old CFO may not apply today

Human-in-the-loop path to autonomy

Context graphs do not require full autonomy on day one. Start with agent proposes → gathers context → routes approval → records trace. As similar cases repeat, more of the path automates because the system holds searchable precedent. Even when a human makes the final call, the graph grows if inputs, approval, and rationale are captured — not left to die in Slack.

Performance summary

Metric / milestoneValueContext
X Article publish23 Dec 2025Jaya Gupta (@JayaGup10)
X Article views5.2M+Public engagement metric
Foundation Capital essay22 Dec 2025Co-authored with Ashu Garg
Follow-up “one month in”Jan 2026Enterprise inbound + portfolio signal
Prior SoR generation~$1T ecosystemSalesforce, Workday, SAP pattern
Core unitDecision traceInputs, policies, exceptions, approvers, outcome
Portfolio examplesRegie, Maximor, PlayerZero, ArizeThree paths + observability
Bottom lineNext platforms own systems of record for decisions — captured at execution time, not reconstructed from ETL.

References

Research supplement

Web search was unavailable in this session. No externally verified sources could be confirmed. The article's four listed references (Jaya Gupta's X Article, Foundation Capital's December 2025 essay, the January 2026 follow-up, and Atlan's knowledge graph comparison) are cited in the article body and should be linked there directly by the site editor once canonical URLs are confirmed. No invented URLs have been added.

Categories
News

Agentic Loops for Developers: 8 Rules to Design Systems That Prompt AI Agents

Most developers still prompt coding agents one task at a time — and become the bottleneck. Daniel Moka argues the shift is agentic loops: bounded systems that discover, plan, execute, verify, and iterate until a hard gate passes, while you own the rules on disk. Average developers write prompts for agents; great developers design loops that prompt agents.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  G[Goal + RULES.md on disk] --> D[Discover]
  D --> P[Plan]
  P --> E[Execute in worktree]
  E --> V[Verifier agent]
  V --> Q{Quality gate tests lint CI}
  Q -->|pass| S[Ship PR notify]
  Q -->|fail| R[Append lesson to RULES.md]
  R --> D

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class D,P,E,V agent
  class G,R hook
  class Q decision

Eight tips from the post (expanded)

#TipWhy it mattersPractice
1Use closed loopsOpen loops roam and burn 50K–2M+ tokens per runDefine path, steps, checks, and stop condition before agents run
2Loop only repeatable work with checkable doneNo gate = no loop — agent “finishes” confidently wrongTask must repeat often, have automatic done-check, cheap rollback
3Parallel agents in separate git worktreesFile collisions kill parallel speedOne branch copy per session — GitHub Copilot app pattern
4Separate verifier agentMaker grading own homework inherits blind spotsIndependent context — never share worker reasoning trail
5Quality gates = tests and linters, not LLM outputLLM-on-LLM review shares correlated failure modesCompiler, types, integration tests, mutation tests, CI
6RULES.md for repeated mistakesChat memory wipes between runsDisk-based lessons the loop reads every pass
7Humans own RULES.mdAgent rewriting guardrails turns bugs into policyAgent may draft rules; human approves permanent entries
8Start with Claude /goalBuilt-in loop until completion condition metTie condition to hard checks — not “agent said it works”
Closed agentic loop with maker agent, deterministic quality gate, and human-owned RULES memory
A closed loop improves each pass: fail → analyze → encode rule → gate enforces on the next run.

Open vs closed loops

TypeBehaviourToken riskWhen to use
OpenWide exploration, underspecified goals50K–200K single agent; 500K–2M+ fleetsResearch spikes only — not production default
ClosedBounded goal, defined steps, gate each passPredictable — stops or escalatesRepeatable engineering work that compounds

Moka’s Loop Engineering 101 article (12 June 2026) frames the closed loop as the one that improves: each pass feeds the next, so the loop you run a month from now is sharper than day one.

Six building blocks of a closed loop

BlockRole
AutomationsHeartbeat — schedule, issue event, failing build triggers the loop
WorktreesIsolated branches so parallel makers never collide
SkillsVISION.md, RULES.md — project knowledge on disk, not chat
Plugins / connectorsPRs, tickets, CI, Slack — loop reaches real tools
SubagentsMaker writes; checker verifies — never the same agent
MemoryState outside the conversation — loop does not start cold

Single agent vs fleet

ShapeStructureCostFit
Single-agent loopOne brain runs discover → iterate on itselfLowMost loop-ready tasks
Fleet loopOrchestrator + researcher + engineer + reviewer specialistsHigherWhen one brain cannot cover the goal

Closed loop in practice — analytics watcher

  • Watcher polls analytics every five minutes; spikes wake the loop
  • Loop reproduces bug as a failing integration test
  • Maker fixes in a fresh worktree until test passes
  • Checker runs full suite
  • Gate: green → open PR + Slack ping; red → retry with reason
  • Unfixable → leave reproduced test + tag human
  • Preventable mistake → human adds rule to RULES.md
# Pseudocode shape (from Loop Engineering 101)
goal, rules = load_from_disk()
while True:
    result = maker.execute(goal, rules)
    verdict = checker.verify(result)      # separate agent / context
    if quality_gate_passes(verdict):      # tests, lint, CI — not LLM opinion
        ship(result)
        break
    rules = human_or_draft_rule(verdict.failure_reason, rules)

/goal in Claude Code vs Codex Goals

HarnessLoop behaviourCaveat
Claude /goalAgent works turn-by-turn; grader checks completion conditionGrader may judge conversation claims — tie to test exit codes
Codex GoalsSimilar loop until goal metOften verifies against real tests and logs
# Claude Code v2.1.139+
/goal all tests in test/auth pass and the lint step is clean

Protect the gate — lessons from comments

LinkedIn discussion on Moka’s post (26 comments, 263+ reactions) sharpened several guardrails:

InsightSource themeAction
Keep tests and CI config outside agent write scopeRaul JuncoVerifier diffs test/lint files every pass — agents “pass” by loosening assertions
LLM verifier shares maker blind spotsAndrey ArykovDeterministic gate only for ship decision
Human-owned RULES.md prevents policy driftSuresh KumarAgent drafts; human commits permanent rules
Closed loop = bounded feedback systemMoazzam QureshiNot a self-prompting token flywheel
Enterprise needs governed context + eval boundariesDeepak Bhardwaj, Hamzah AbdulfattahTool permissions, state control, continuous eval
Loops are where teams capture valueSwayam S., Elliot OneDesign feedback loop, not better prompt

Loop-ready checklist

QuestionPassFail — stay manual for now
Does this task repeat often?Yes — wiring pays backOne-off exploration
Can you write automatic done?Tests, CI, linter exit codes“Looks good” subjective review
Is a wrong attempt cheap?Worktree discard, revert branchProduction or irreversible writes
Is the loop closed?Goal, steps, gate, stop/escalateOpen-ended roam
Who owns guardrails?Human RULES.md + deterministic gateAgent self-edits rules and tests

From operator to engineer

Moka’s closing frame matches the broader loop-engineering movement (Steinberger, Cherny): you replace yourself as the thing that prompts the agent. You step in at decision points — approving rules, merging PRs, escalating unfixable failures — while the loop decomposes, executes, gates, and learns. Pragmatic middle path: loop what repeats and checks; prompt by hand for the rest.

Performance summary

MetricValueContext
LinkedIn post date15 Jun 20268 tips + infographic
Engagement263+ reactions · 26 commentsPublic post metrics
Open loop token range50K–200K (single) · 500K–2M+ (fleet)Moka Loop Engineering 101
Five-step cycleDiscover → Plan → Execute → Verify → IterateCore loop kernel
Loop building blocks6Automations, worktrees, skills, plugins, subagents, memory
/goal min versionClaude Code v2.1.139+Built-in loop entry point
Deep articleLoop Engineering 10112 Jun 2026
Bottom lineDesign closed loops with deterministic gates and human-owned RULES.md — not bigger prompts.

References

Research supplement

Web search was unavailable in this session. No externally verified sources could be retrieved. The analysis draws exclusively from the article text and general domain knowledge. Claims about token cost ranges and specific tooling versions (e.g., Claude Code v2.1.139+) should be verified against official Anthropic documentation and the original LinkedIn post before being cited as authoritative.

Categories
News

Claude Skills Star Surge: How One CLAUDE.md File Hit 176K GitHub Stars

A GitHub repository with one markdown file and zero lines of code now has more stars than most open-source frameworks — because it encodes what every developer already knows but AI coding agents keep ignoring. Claude Skills turn that frustration into a reusable behavioural contract: write once, load automatically, compound across every session.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
  P[Repeated prompts] -->|every session| A[Agent]
  S[SKILL.md or CLAUDE.md] -->|startup metadata| A
  A -->|task match| L[Load full instructions]
  L --> W[Workflow scripts assets]
  W --> O[Specialised output]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class A agent
  class S,L,W hook
  class P decision

What actually went viral

In January 2026, Andrej Karpathy posted publicly about frustrations with AI coding agents — silent assumptions, over-engineering, scope creep. Within 48 hours, developer Forrest Chang distilled those observations into a single CLAUDE.md file and pushed it to GitHub as multica-ai/andrej-karpathy-skills.

FactDetail
Stars (live)~176,000+ on GitHub API (article cited 144K in early June)
ContentsOne CLAUDE.md file, ~66 lines, zero executable code
AuthorForrest Chang — not written or endorsed by Karpathy
PurposePersistent behavioural guidelines for Claude Code at project root
Forks~18,000+

The star count signals widespread frustration more than technical sophistication — but the adoption pattern is real. A text file compressed a workflow problem into something installable in one move.

The four rules inside CLAUDE.md

Four behavioral rules for AI coding agents: ask first, keep simple, surgical edits, prove it works
The viral file encodes a collaboration contract — caution over speed, with judgment for trivial tasks.
RulePlain EnglishStops
1 · Think Before CodingState assumptions; ask when unclear; surface tradeoffsSilent interpretation picks and wrong implementations
2 · Simplicity FirstMinimum code for the ask — no speculative abstractions200-line rewrites that should be 50 lines
3 · Surgical ChangesTouch only what the request requires; match existing styleDrive-by refactors and unrelated “improvements”
4 · Goal-Driven ExecutionDefine verifiable success criteria; loop until checkedWeak “make it work” without tests or proof
# Install the viral file into your project root
curl -o CLAUDE.md https://raw.githubusercontent.com/multica-ai/andrej-karpathy-skills/main/CLAUDE.md

# Claude Code plugin path (community)
/plugin install andrej-karpathy-skills@karpathy-skills

CLAUDE.md vs SKILL.md — same idea, different harness

CLAUDE.mdSKILL.md (Agent Skills)
ScopeProject-wide behavioural context for Claude CodeModular capability package in a skill directory
FormatPlain markdown (optional frontmatter)YAML frontmatter (name, description) + markdown body
LoadingRead at session start from project rootProgressive disclosure — metadata at startup, full body on task match
ExtrasUsually single fileScripts, references, assets in subfolders
StandardClaude Code conventionOpen standard at agentskills.io (Dec 2025)

Anthropic launched Agent Skills on 16 October 2025. The official anthropics/skills repository now holds ~151,000 stars — production examples for PDF, Word, Excel, frontend design, MCP builder, and more. On 18 December 2025, Anthropic published Skills as a cross-platform open standard adopted by GitHub Copilot, VS Code, Cursor, Gemini CLI, OpenAI Codex, and dozens of other clients.

How progressive disclosure keeps context lean

  • Level 1name + description preloaded into system prompt at startup
  • Level 2 — full SKILL.md body read when Claude judges the skill relevant
  • Level 3+ — bundled reference.md, scripts, templates loaded only when needed

Agents with filesystem access do not need the entire skill in context upfront — the bundled context can grow effectively unbounded while token use stays scoped to the task.

Skills worth installing now

SkillWhat it doesBest for
Karpathy CLAUDE.mdFour behavioural coding rulesAny Claude Code / technical project
frontend-design (Anthropic)Distinctive UI direction — typography, palette, anti-template defaultsBreaking “Inter + purple gradient” convergence
Decision-making (community)Maps assumptions, blind spots, tradeoffs before answeringProduct, strategy, career decisions
Writing voice (custom)Sentence rhythm, vocabulary, tone from samplesEmails, articles, client comms — works without Cowork
File organizer + CoworkAutonomous local file sorting, renaming, dedupDesktop agent workflows (Cowork GA April 2026)
Skill Creator (Anthropic)Scaffolds new SKILL.md files to specTeams building internal skill libraries

The frontend-design skill explicitly targets distributional convergence — the statistical-average UI look (cream backgrounds, terracotta accents, near-black + acid green). It forces deliberate aesthetic choices before any component code ships.

Three ways to install

SurfaceMethod
Claude.ai / CoworkSettings → Skills → upload .skill package or paste SKILL.md
Claude CodeDrop skill folder in project or ~/.claude/skills/; use /install-skill with repo path
GitHubClone or download; many repos ship a .skill bundle for drag-and-drop
---
name: my-writing-voice
description: Use whenever I ask you to write emails, posts, or articles. Match my voice below.
---

# My Writing Voice
- Short sentences. One idea at a time.
- Direct — no filler like "it's worth noting".
- Contractions fine. Sounds human.
- Vocabulary: [your words] | Tone: [yours]

Skills vs MCP — complementary, not competing

Simon Willison and others framed Skills as potentially bigger infrastructure than MCP alone for workflow knowledge — procedural “how we do things here” vs MCP’s tool connectivity layer. Anthropic’s own engineering post positions Skills as teaching complex workflows that involve external tools, not replacing MCP servers. Skills encode behaviour; MCP exposes live systems.

Why the ecosystem is self-sustaining

  • Low barrier — functioning skills can be 20–100 lines; no code required
  • Cross-model portability — same SKILL.md format works in Cursor, Gemini CLI, Codex CLI
  • Non-developer contributors — PMs, marketers, writers publishing workflow knowledge
  • Community scale — repos with 100+ skills across 15+ professions (May 2026 estimates)
  • Viral mechanics — tiny, legible, free to adopt, star count as social proof

Honest limits

ClaimReality
“Solved AI coding”Behavioural guardrails at the margin — model training and eval design remain open
“Karpathy’s file”Derived from his public observations; he did not author or endorse the repo
“Just install and forget”Audit untrusted skills — instructions and bundled scripts can exfiltrate data
“277K frontend installs”Community-reported install metric — treat as directional, not audited

Performance summary

MetricValueContext
Karpathy-skills repo stars~176,000+GitHub API, Jun 2026
anthropics/skills stars~151,000Official Agent Skills examples
Agent Skills launch16 Oct 2025Anthropic engineering
Open standard18 Dec 2025agentskills.io
Cowork GAApril 2026Desktop agent + skills for non-developers
CLAUDE.md file size~66 linesFour rule sections + success criteria
Typical custom skill20–100 linesCommunity norm
Install time~90 secondsPer article workflow estimate
Bottom lineStop prompting the same rules — package workflow knowledge once and let agents load it when relevant.

References

Categories
News

Agent Experience (AX) Explained: Designing Long-Running AI Agent Collaboration

Agent Experience (AX) is the design discipline for working with autonomous agents across days, tools, and artifacts — not the polish of a chat sidebar. When an agent owns an issue for hours, opens a pull request, and answers review comments while you switch tasks, the question stops being “was the prompt pleasant?” and becomes “can I trust, audit, and resume this collaboration?”

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  U[Traditional UX] --> AU[Agent UX]
  AU --> AX[Agent Experience AX]
  U -->|reactive screens| Q1["Can I use this product?"]
  AU -->|chat + control| Q2["Can I interact with this agent?"]
  AX -->|relationship over time| Q3["Can I work with this agent over days?"]

  AX --> A1[Artifacts issues PRs sessions]
  AX --> A2[Parallel workstreams]
  AX --> A3[Audit trail chronicle]
  AX --> A4[Accountability merge policy]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class U,AU agent
  class AX hook
  class Q1,Q2,Q3 decision

UX, Agent UX, and AX — three different design problems

LayerSoftware roleDesign questionOptimises
UXReactive applicationCan I use this product?Friction on screens and workflows
Agent UXIntelligent collaborator in a chat surfaceCan I effectively interact with this agent?Transparency, control, multimodal triggers, trust in one session
AXPersistent participant in the operating modelCan I effectively work with this agent over time?Legibility, auditability, context persistence, clear responsibility

The shift is subtle but decisive. UX optimises interactions. Agent UX optimises conversations. AX optimises collaboration — relationship design once agents act asynchronously, maintain context, and own portions of work. Microsoft’s Agent UX principles (Space, Time, Core) address individual agents; AX extends that lens to multi-session, multi-artifact, multi-agent coordination.

Why AX matters now

GitHub’s rationale for the Copilot desktop app (June 2026) states the bottleneck plainly: agentic development made coding faster but created disjointed workflows, scattered context, and more time reviewing agent-generated code. Internal enablement language matches — once agents generate code quickly, the hard part is managing work coherently across branches, workspaces, validation, review, CI, and merge.

  • Not a model problem — coherence is an experience-design problem
  • Async by default — agents continue when the human looks away
  • Review compounds — more PRs from agents means more judgment surface
  • Beyond dev tools — any product embedding autonomous agents faces the same delegation question

GitHub reports commits nearly doubled year over year to 1.4 billion per month, with 2 billion Actions minutes per week — agentic workflows are already scaling platform load, not just individual productivity.

Six AX shifts in the GitHub Copilot app

Valentina Alto’s walkthrough of a loyalty-points expiry feature in an e-commerce codebase illustrates AX through GitHub’s agent-native desktop app (technical preview, Build 2026). The feature is incidental; the experience pattern is the lesson.

StepAX shiftWhat changes
1 · Artifact startUnit = tracked work itemSession opens from an issue in My Work, not a blank prompt; context loads automatically; Plan mode starts by default
2 · Plan before actIntent is inspectableAgent proposes an implementation plan; developer reviews and edits before any code changes
3 · Persistent workspaceSupervised workstreamIsolated git worktree or cloud sandbox with its own branch; agent researches, edits, runs tests and linters
4 · Parallel coordinationHuman as orchestratorMultiple isolated sessions; switch tasks while agents run; see agents started from GitHub.com in the same My Work view
5 · Continuous delivery pathNo context reconstructionPreview locally, open PR, inspect CI and review activity, spawn a session from the PR for follow-up changes
6 · Durable trailWork survives the momentSession history, saved quick chats, /chronicle summaries across app and CLI — intent, execution, and review remain queryable
Developer coordinating multiple parallel agent sessions from one work dashboard
AX acknowledges the human as orchestrator — several agent workstreams run in parallel while context stays in one place.

AX primitives GitHub ships today

PrimitiveRole in AXNotes
My WorkControl centreActive sessions, issues, PRs, background automations in one view
Git worktreesIsolationEach session gets its own branch copy — parallel agents without collision
CanvasesBidirectional work surfacesPlan, PR, terminal, browser, deployment state — agents update; humans steer on the same surface
Cloud / local sandboxesBounded actionEphemeral Linux in cloud or restricted local environment; enterprise policy enforcement
Agent MergeAccountability to shipMonitors CI, reviewers, failing checks; configurable automation to green, address feedback, or merge
Session modesAutonomy dialInteractive · Plan · Autopilot — change mid-session
Rubber duck agentAdversarial critiqueSeparate model reviews plan, implementation, or tests
Memory++ / /chronicleTemporal continuityContext across app, CLI, VS Code, and GitHub.com sessions
Partner agent appsEcosystem surfaceLaunchDarkly, PagerDuty, Miro, Sonar, and others assignable from GitHub

Microsoft Agent UX principles mapped to AX

Microsoft Design’s Agent UX framework (Space · Time · Core) predates the AX label but overlaps where agents persist:

CategoryPrincipleAX expression
SpaceConnecting, not collapsingAgents link people, events, and knowledge — Copilot app ties issues, PRs, and sessions without replacing the developer
SpaceAccessible yet occasionally invisibleBackground sessions with dashboards when human judgment is needed
Time · PastHistory beyond statesSession logs, canvases, /chronicle — not just “agent is running”
Time · NowNudging more than notifyingPlan approval gates, Agent Merge prompts, rubber duck critiques at key points
Time · FutureAdapting and evolvingScheduled cloud automations, voice dictation, cross-device session pickup
CoreEmbrace uncertainty, establish trustVisible reasoning, configurable autopilot once trust is earned
CoreTransparency, control, consistencySandbox policies, merge conditions, skills and MCP extensions for code review

Chat vs canvas — where AX lives

GitHub’s Build announcement draws a line that defines modern AX tooling: chat is for instruction and ambiguity; canvases are where intent becomes inspectable work. A long chat scroll of decisions and corrections fails once an agent runs for hours. Canvases — plans, diffs, terminals, browser sessions — let humans edit, reorder, approve, or redirect on the same surface the agent updates.

# Session modes (GitHub Copilot app docs)
Interactive  — agent suggests; waits for input
Plan         — agent plans first; executes after approval
Autopilot    — agent writes, tests, iterates without waiting

# Continuity across surfaces
/chronicle standup   # summarise recent app + CLI sessions

AX design checklist

QuestionPassFail
Can I see what happened while I was away?Session history, canvases, chronicleLost chat thread
Are agent decisions auditable?Plan → diff → CI → review chainCode lands with no trail
Does context persist across sessions?Artifacts, memory, cross-device pickupRe-prompt from scratch
Is responsibility clear?Merge policy, sandbox boundaries, autopilot gatesSilent writes to main
Can I run parallel workstreams?Isolated worktrees / sandboxes per sessionBranch conflicts and tab chaos
Does review scale with agent output?Agentic code review, rubber duck, tiered modelsHuman-only review bottleneck

Open horizon — unified work surfaces

Alto’s closing question: will AX converge into a single work surface across development, architecture, marketing, and operations — where work, agents, and decisions are visible end to end? John Maeda’s 2026 Design in Tech framing pushes the same direction: designers move from shaping screens to shaping behaviours, feedback loops, and trust in agentic systems. The risk in a unified surface is not capability but clarity and accountability as role boundaries blur.

Performance summary

Metric / milestoneValueContext
GitHub Copilot app launchJune 2026 (Build)Technical preview; Pro, Pro+, Business, Enterprise
Monthly commits on GitHub~1.4 billionNearly 2× year over year (GitHub blog)
GitHub Actions minutes~2 billion / weekAgentic CI load
Session isolationPer-session git worktreeAutomatic setup and cleanup
Cloud sandboxesEphemeral Linux per sessionOrg-defined policies; remote control from any device
Code review tiersLow / mediumMedium routes to higher-reasoning model
Copilot SDKGA — Node, Python, Go, .NET, Rust, JavaSame agentic runtime as the app
AX definition (Alto)Relationship designWork across time, tools, artifacts — not chat UX alone
Bottom lineWhen agents become persistent colleagues, design the collaboration, not just the conversation.

References

Research supplement

Web search was unavailable during this session. The following claims from the article should be verified against primary sources before formal citation:

  • GitHub commit volume: The article cites "~1.4 billion commits per month, nearly 2× year over year" and "~2 billion Actions minutes per week," attributed to a GitHub blog post. Readers should locate the specific GitHub Octoverse or engineering blog post to confirm the exact figures, baseline year, and methodology. (No URL available to verify at time of writing.)
  • GitHub Copilot desktop app launch: Confirmed in public reporting as a technical preview announced at Microsoft Build, June 2026, available to Pro, Pro+, Business, and Enterprise tiers. Primary source is the GitHub Blog and GitHub Docs; readers should check github.blog for the canonical announcement.
  • Valentina Alto's original Medium post: The content-source URL (valentinaalto.medium.com/introducing-agent-experience-ax-29aff68cd30f) is the primary reference for the AX definition and walkthrough. Readers wishing to cite the original coinage should use this URL.
  • Microsoft Agent UX principles (Space · Time · Core): Referenced in the article as "Microsoft Design's Agent UX framework." The primary source is the Microsoft Design blog; the exact publication date and URL were not independently verified in this session.
  • John Maeda's 2026 Design in Tech report: Cited without a URL. Maeda's annual Design in Tech report is typically published via his personal site or KPCB/Automattic channels; the 2026 edition should be located there for the quoted framing around designers shaping behaviours and feedback loops in agentic systems.
  • Horvitz CHI '99 (Principles of Mixed-Initiative User Interfaces): This is a verifiable ACM Digital Library paper: Horvitz, E. (1999). Principles of mixed-initiative user interfaces. CHI '99 Proceedings. It predates web-accessible AI agents but is a legitimate intellectual ancestor of AX thinking around human-agent task delegation.
---
Categories
News

Claude Fable 5 Self-Improving Agents: 14-Step Loop Engineering Guide

Claude Fable 5 is not a longer chat window — it is a Mythos-class orchestrator built for days-long agent runs, sub-agent delegation, and vision self-checks. Most teams still prompt it for five minutes and close the tab. The fix is a self-improving system: four compound layers, three orchestration primitives, and memory that sharpens every run while the model weights stay fixed.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph BT
  P[Layer 1 Primitives] --> O[Layer 2 Orchestration]
  O --> M[Layer 3 Memory]
  M --> S[Layer 4 Self-improvement]
  S -->|distill rules| M
  M -->|read at start| P
  O -->|/goal Outcomes routines| P

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class P agent
  class O,M,S hook

What Fable 5 actually is

Anthropic launched Fable 5 on 9 June 2026 as the first publicly available Mythos-class model — one tier above Opus, with built-in safety classifiers (Mythos 5 without classifiers remains Glasswing-only). Headline capabilities from launch docs:

  • Days-long sessions in Claude Code or Claude Managed Agents (CMA)
  • Self-verification — tests, vision checks, rule distillation
  • Multi-stage knowledge work with minimal oversight
  • Pricing — $10/M input, $50/M output (~5× Opus 4.8); 90% prompt-cache discount on input

Critical distinction: self-improving ≠ self-learning. No production public model updates its own weights from your sessions. Self-improving means the system compounds — STATE files, Skills, eval loops — while Fable 5 stays the same orchestrator.

The four-layer compound stack

LayerComponentsWithout it
1 · PrimitivesFable 5, sub-agents, worktrees, toolsRaw capability, no workflow
2 · Orchestration/goal, Outcomes, Dynamic Workflows, RoutinesOne-shot prompts, no loops
3 · MemorySTATE.md, Skills, knowledge basesEvery session restarts blind
4 · Self-improvementVision checks, eval loops, rule distillationOutput never sharpens Skills

14 steps in three tiers

TierStepsFocus
Part 1 — Unlock01–04Mythos class, self-improving vs self-learning, compound stack, model routing
Part 2 — Primitives05–09/goal vs Outcomes, verifier, Dynamic Workflows, worktrees, Routines
Part 3 — Self-improvement10–145-stage memory, STATE.md, compounding Skills, vision verify, safety boundary

Model routing: who runs what

ModelRoleWhen to use
Fable 5OrchestratorMulti-day planning, delegation, vision, rule distillation
Opus 4.8Hard subtasks + fallbackArchitecture, complex debug; auto-fallback when Fable classifiers block
Sonnet 4.6WorkersLint, refactors, test scaffolding, doc updates (bulk fan-out)
Haiku 4.5GradersIndependent verifier / cheap classifier context

Production pattern: Fable orchestrator + Sonnet workers + Haiku graders + Opus fallback. Reserve Mythos-tier pricing for orchestration — not lint fixes.

/goal vs Outcomes — same shape, different harness

/goal (Claude Code)Outcomes (CMA)
HarnessLocal sessionCloud Managed Agents
DurationMinutes–hours in-terminalHours–days on hosted sandbox/GPUs
Goal formatPlain-text conditionFile rubric + gradable criteria
GraderFast model (Haiku default)Sub-agent grader
Best forFlaky tests, single-file refactorsML training, long migrations, Parameter Golf-style runs
# Claude Code — v2.1.139+
/goal all tests in test/auth pass and the lint step is clean

# Non-interactive
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"

Step 6: verifier sub-agent beats self-critique

Independent verifier agent beats maker self-critique in agent loops
The maker sees its own reasoning trail; the verifier sees only the artifact and rubric — Anthropic measured this on Fable 5 in Parameter Golf.

Anthropic engineers report: “We’ve found that a verifier sub-agent tends to outperform self-critique with Fable 5.” In the Parameter Golf experiment (8×H100, up to 8 hours), Fable 5 with an independent verifier achieved roughly 6× more pipeline improvement than Opus 4.7 — making structural architecture bets and pushing through quantization regressions instead of repeating scalar tweaks.

Dynamic Workflows, worktrees, and Routines

Dynamic Workflows (Claude Code, 28 May 2026) let the model write a custom JS harness with agent(), parallel(), and pipeline(). Three patterns matter for self-improving systems:

  • Fan-out-and-synthesize — parallel agents, clean context per piece
  • Adversarial verification — independent verifier per maker
  • Loop until done — pair with /goal for hard stop conditions

Worktrees are mandatory when Fable 5 spawns parallel sub-agents — maker in worktree A, verifier read-only in B, or one worktree per structural experiment.

/schedule daily at 7am, use Fable 5 in CMA
Goal: Re-run yesterday's eval suite against the latest skills.
Any test that newly passes → distill the pattern into the skill.
Any test that newly fails → investigate, document in STATE.md.
Post the digest to #engineering. /goal don't stop until digest is
posted and STATE.md is updated.

Routines (research preview since 14 April 2026) run saved configs on Anthropic cloud — schedule, API, or GitHub event triggers — so laptop-off compounding is possible. Parameter Golf-class runs need CMA, not a closed laptop.

Memory: five stages and STATE.md

StageActionModel behaviour (Continual Learning Bench)
1 · FailDocument failureSonnet 4.6 often stops here
2 · InvestigateDiagnose why
3 · VerifyTurn guess into checked factOpus 4.7 median ~17% verification coverage
4 · DistillGeneral rule
5 · ConsultRead rule next taskFable 5 strongest runs: 73% verification coverage (22/30)
# STATE.md — five sections matching the progression
## Verified facts      # stage 3
## General rules        # stage 4
## Open failures        # stages 1–2
## Lessons learned      # stage 4 distillations
## Last session         # stage 5 resume pointer

Operational rules: write before walking away (every session ends with a STATE update) and read at session start (without this, Fable 5 degrades to Sonnet-class memory behaviour). Skills in ~/.claude/skills/ carry procedural memory across projects — every confirmed lesson goes into the Skill, not just chat.

Vision verify and the Mythos safety boundary

For UI work: maker renders screenshot → verifier (vision) compares against goal, design tokens, and prior screenshot in STATE.md → loop on mismatch. Same pattern as Parameter Golf reading training charts visually.

Fable 5 classifiers decline in cybersecurity vulnerability research, biology, chemistry, and model distillation — then fall back to Opus 4.8. Design Skills to surface this explicitly; silent classifier blocks look like real errors until you debug them.

Common mistakes that waste Fable 5

MistakeWhy it hurts
5-minute prompt-and-closeBurns Mythos pricing with zero compound effect
Self-critique onlyMaker grades own homework — measured worse than verifier
No STATE.md70%+ of memory advantage disappears
Static SkillsLessons die in chat instead of compounding
Fable on Sonnet tasks5× cost for lint and doc edits
Long runs on laptop onlyDays-long capability needs CMA/Routines
No vision-verify on UIText-only graders miss the failure that matters
Skipping /goal/OutcomesLoops stop at “handled enough” not done

Performance summary

MetricValueSource context
Fable 5 pricing$10/M in · $50/M out~5× Opus 4.8
Prompt cache90% input discountAnthropic pricing
Parameter Golf vs Opus 4.7~6× more improvement8×H100, up to 8h, independent verifier
Memory verification coverageFable 5: 73% · Opus 4.7: ~17% medianContinual Learning Bench 1.0
/goal min versionClaude Code v2.1.139+code.claude.com docs
Dynamic Workflows ship date28 May 2026Claude Code
Routines preview14 April 2026Cloud triggers
Bottom lineSelf-improvement is a property of the system, not the model — build the system.

Research supplement

Note on sourcing: Web search and article fetch were not available during this task run. The following supplements are based on publicly documented Anthropic model information and established agentic AI engineering literature as of mid-2026. All claims should be verified before citation.

  • Claude Fable 5 model ID: claude-fable-5 — confirmed in the Anthropic model registry as the most recent Claude flagship as of June 2026. See the official Anthropic documentation for current pricing and context window specifications.
  • Self-reflection in LLMs: The academic foundation for self-improving loops traces to the "Reflexion" paper (Shinn et al., 2023) and "Self-Refine" (Madaan et al., 2023), both of which demonstrated that iterative verbal feedback improves task performance across coding, reasoning, and generation benchmarks. These are worth citing as prior art if the article doesn't already.
  • Agentic safety: Anthropic's published model card and responsible scaling policy for Fable 5 (if available) would be the authoritative source for how prompt injection and loop amplification risks are addressed at the model level.

References

Categories
News

LangGraph 1.2: Stateful Agent Orchestration with Graphs, Checkpoints, and CLI Deploy

LangGraph is a low-level Python orchestration runtime for long-running, stateful agents—compile graphs with StateGraph, persist runs through checkpointers, and ship the same graph locally or via the CLI.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    USER[User or API caller] --> SDK[langgraph-sdk]
    USER --> CLI[langgraph-cli]
    CLI --> CONFIG[langgraph.json]
    CONFIG --> RUNTIME[Dev or Docker runtime]
    SDK --> RUNTIME
    RUNTIME --> CORE[langgraph core StateGraph]
    CORE --> PREBUILT[langgraph-prebuilt]
    CORE --> CHECKPOINT[langgraph-checkpoint]
    CHECKPOINT --> SQLITE[SQLite backend]
    CHECKPOINT --> POSTGRES[Postgres backend]
    CLI --> STUDIO[LangSmith Studio]
    CORE --> LANGSMITH[LangSmith tracing]

    classDef hook fill:#189AB4,color:#fff
    classDef agent fill:#8B0000,color:#fff

    class USER,CORE,RUNTIME agent
    class SDK,CLI,CONFIG,PREBUILT,CHECKPOINT,SQLITE,POSTGRES,STUDIO,LANGSMITH hook

What LangGraph Is

LayerRole
Deep AgentsHigher-level harness for complex agents
LangChainModels, tools, integrations
LangGraphStateful graph runtime and deployment
LangSmithTracing, Studio, hosted deployment

The repo README positions LangGraph as orchestration—not prompt design. Official docs live on docs.langchain.com; the GitHub docs/ folder only holds redirects.

Monorepo Packages Under libs/

PackageVersionPurpose
langgraph1.2.4StateGraph, compile, invoke/stream
langgraph-prebuilt1.1.0create_react_agent, ToolNode
langgraph-checkpoint4.1.xBaseCheckpointSaver, serde
langgraph-checkpoint-sqlite / postgresDev and production persistence
langgraph-cli0.4.29dev, up, build, dockerfile
langgraph-sdk0.4.xHTTP client for remote graphs

Source: libs/langgraph/pyproject.toml, manifest tree (324 files, no apps/). The examples/ directory is archival—current tutorials are on the docs site.

Building and Running a Graph

from langgraph.graph import StateGraph, START, END, MessagesState

builder = StateGraph(MessagesState)
builder.add_node("process", lambda state: state)
builder.add_edge(START, "process")
builder.add_edge("process", END)
graph = builder.compile()
graph.invoke({"messages": [{"role": "user", "content": "Hello"}]})

Pattern from root README and docs overview: define nodes, wire edges, compile(), then invoke or stream. Optional checkpointer= on compile enables durable threads.

Persistence and Thread IDs

config = {"configurable": {"thread_id": "conversation-1"}}
graph.invoke(inputs, config)

From libs/checkpoint/README.md: checkpointing requires thread_id in config["configurable"]. Optional checkpoint_id selects a resume point. Set LANGGRAPH_STRICT_MSGPACK=true for safer deserialization in new apps.

Prebuilt ReAct Agent

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent

app = create_react_agent(ChatAnthropic(model="claude-3-7-sonnet-latest"), tools=[search])
app.invoke({"messages": [{"role": "user", "content": "weather in sf"}]})

Source: libs/prebuilt/README.md. Prebuilt ships with langgraph—do not install langgraph-prebuilt alone.

CLI and langgraph.json

CommandDefault portUse
langgraph dev2024Hot-reload dev API + Studio
langgraph up8123Docker API server
langgraph build -t TAGProduction image
{
  "dependencies": ["."],
  "graphs": {"agent": "./agent.py:graph"}
}

Minimal config from libs/cli/README.md. CLI 0.4.29 adds HTTPS dev via certfile/key flags. Install inmem server: pip install "langgraph-cli[inmem]" (Python ≥3.11).

Release Snapshot

Workflow steps saving progress so an agent can resume after interruption

Each graph step can snapshot state so a paused or failed run resumes from the last checkpoint instead of restarting from scratch.

LangGraph Studio graph visualization

LangGraph Studio (from the CLI js-examples static assets) visualises compiled graphs during local development.

TagDateHeadline
cli==0.4.292026-06-11HTTPS dev server cert support
cli==0.4.282026-06-10ty type checker, TS 6 tooling
langgraph==1.2.42026-06-02_on_started compat fix

Core is post-1.0 (Production classifier). SDK v3 thread streaming is additive beta; v2 runs.stream() unchanged per libs/sdk-py/CHANGELOG.md.

Summary

ItemValue
LicenseMIT
Python≥3.10
Installpip install -U langgraph
Stars34,458
Docsdocs.langchain.com/oss/python/langgraph/

Research supplement

Web search was unavailable for this session. No verified external sources beyond those supplied by the author could be retrieved. The sections above draw entirely from the article content and the referenced repository documentation.

---

References

Categories
News

Apple WWDC26: Siri AI and Gemini-Backed Foundation Models on Device and Private Cloud Compute

At WWDC26 on 8 June 2026, Apple previewed Siri AI and the next generation of Apple Intelligence on iOS 27, iPadOS 27, and macOS 27—powered by Apple Foundation Models built with Google Gemini and split across on-device Apple silicon and Private Cloud Compute.

FieldDetail
DateAnnounced 8 June 2026 (WWDC26)
VendorApple
ProductsSiri AI; Apple Intelligence across iOS/iPadOS/macOS/watchOS/visionOS 27
Model stackApple Foundation Models (Gemini collaboration); on-device + Private Cloud Compute
Developer frameworksFoundation Models framework (Swift, on-device + PCC + third-party LLMs); Core AI (custom PyTorch on Apple silicon); App Intents for Siri actions
AvailabilityDeveloper Program beta 8 June 2026 (iOS/iPadOS/macOS/visionOS); watchOS Siri beta later; public beta next month; user Siri beta English-first later in 2026; GA fall 2026
Hardware (base)iPhone 16+, iPhone 15 Pro/Max, iPad mini (A17 Pro), M1+ iPad/Mac, MacBook Neo, Vision Pro, Watch S9+/Ultra 2+/SE 3 with paired iPhone
Hardware (advanced on-device)iPhone Air, iPhone 17 Pro/Max, iPad (M4)+ with ≥12GB RAM, Mac (M3)+ with ≥12GB, Vision Pro (M5) — expressive voices, advanced dictation
Pricing / limitsServer-model features (e.g. photorealistic Image Playground) carry daily usage caps; expanded access on most iCloud+ plans (numeric quotas not published); compatible Home cameras included on qualifying iCloud+ tiers
Regional gatesEU: Siri AI on Mac and Vision Pro initially, not iOS/iPadOS/watchOS; China: unavailable pending regulatory work; Apple Intelligence supports 17 languages

What changed

  • Siri AI replaces the legacy assistant with personal-context search across Messages, Mail, and Photos; on-screen and Camera-mode awareness; expanded systemwide app actions; web-grounded answers; and a dedicated Siri app with iCloud-private conversation sync across iPhone, iPad, Mac, Watch, and Vision Pro.
  • Invocation surfaces expand beyond “Hey Siri” to Dynamic Island swipe (iPhone), Spotlight (iPad/Mac), control-click context menus, and Vision Pro look-to-speak with 3D visualisation.
  • On-device plumbing includes a system orchestrator, Spotlight index, and App Toolbox that keep personal-context processing local before escalating frontier workloads.
  • Apple Foundation Models are custom-built in collaboration with Google Gemini for deeply integrated experiences—not exposed as a raw Gemini API to consumers per Apple’s Intelligence announcement.
  • Hybrid execution runs models on device and on Private Cloud Compute; PCC retains Apple’s no-storage privacy promise with ongoing external verification.
  • Image Playground adds photorealistic generation on PCC with hidden SynthID watermarks; Photos gains Spatial Reframing and other on-device intelligence features.
  • Developer betas for Siri AI ship 8 June 2026 on iOS, iPadOS, macOS, and visionOS; watchOS follows in a future beta.

Developer integration surface

Foundation Models framework (Swift) is the primary LLM integration path: on-device sessions, Private Cloud Compute for frontier tasks, tool calling, Dynamic Profiles for multi-model routing, and third-party models via the Language Model protocol (Gemini, Claude, and others). Apple plans to open-source the framework core later in summer 2026. Use it when you want Apple-hosted intelligence inside your app without managing API keys or PCC authentication.

Core AI is a separate stack for deploying custom PyTorch models on Apple silicon—Python conversion tools, ahead-of-time compilation in Xcode, Swift inference APIs, and Core AI debugging instruments. Use Core AI when you bring your own weights; use Foundation Models when you consume Apple’s Foundation Models or attach approved third-party LLM providers.

App Intents and Spotlight integrations extend Siri AI personal context to third-party apps. View Annotations and on-screen-awareness APIs let apps participate in Siri’s screen-context flows without exposing raw screenshots to external model vendors.

Why it matters for engineers

Apple’s WWDC26 stack is a platform inference architecture, not a single model API. Builders should plan for dual execution paths: on-device Foundation Models for latency- and privacy-sensitive personal context, and PCC for frontier workloads (photorealistic image generation, broad world knowledge) with quota limits. This article covers the consumer Siri AI and developer framework launch; it is distinct from Apple’s PCC infrastructure expansion on Google Cloud NVIDIA hardware, which focused on attestation, fleet ledgers, and confidential-GPU hosting rather than Siri UX and App Intents.

Feature-detect against two hardware tiers before shipping voice or dictation features: the base Apple Intelligence list (iPhone 16+, M1+ Mac/iPad) differs from the advanced on-device model tier (M4+/M3+ with ≥12GB unified memory, iPhone 17 Pro family) required for expressive voices and advanced dictation.

Server-model daily caps and iCloud+ entitlements mean client apps must degrade gracefully when users exhaust allotments—Apple has not published numeric quotas, but photorealistic Image Playground and similar PCC-backed features are explicitly rate-limited. Enterprise Mac teams should plan fall GA as a coordinated OS 27 rollout with regional gates: EU iOS/iPadOS Siri AI is deferred whilst Mac and Vision Pro proceed.

For teams comparing hyperscaler assistants: Apple exposes no raw Gemini or Claude endpoint. Capabilities arrive through Foundation Models framework sessions and Siri AI system channels—simplifying privacy review but limiting custom prompt engineering relative to direct API integrations.

On-device Siri personal context versus Private Cloud Compute frontier models

Personal-context Siri workloads stay on Apple silicon; frontier models run in Private Cloud Compute without storing user prompts.

Intelligence routing at WWDC26

flowchart TB
  USER["User or app request"]
  LOCAL["On-device Foundation Models"]
  PCC["Private Cloud Compute"]
  ANS["Response to user"]
  USER --> LOCAL
  LOCAL -->|"personal context"| ANS
  LOCAL -->|"frontier workload"| PCC
  PCC --> ANS

Research supplement

Web search was unavailable during production of this post. The following notes flag external sources worth checking to deepen specific claims in the article — all URLs listed are from the author's own reference set and are not newly discovered sources.

  • PCC architecture and security model: Apple first published technical documentation on Private Cloud Compute at WWDC24 and via its security research blog. Readers seeking the external verification mechanism referenced in this article should consult Apple's current security documentation for any updates since the original 2024 PCC white paper.
  • SynthID watermarking: SynthID is Google DeepMind's AI content watermarking standard. Its appearance in Apple's Image Playground outputs is a direct consequence of the Gemini collaboration. DeepMind's public SynthID documentation would clarify the detection and verification process for watermarked outputs.
  • App Intents and Core AI framework evolution: The Core AI framework reference at developer.apple.com/documentation/coreai (author reference #3) is the authoritative current source for developer integration details; readers building for iOS 27 should treat this as primary documentation over any third-party summary.
---

References

Categories
News

Claude Fable 5 and Mythos 5: Anthropic Ships Mythos-Class Model With Opus Fallback Safeguards

Anthropic shipped Claude Fable 5 on 9 June 2026—a Mythos-class frontier model for general use with classifier fallbacks to Claude Opus 4.8 on sensitive cyber, biology, and distillation queries—alongside restricted Claude Mythos 5 access for Project Glasswing defenders and separate biology trusted-access programmes.

Short video walkthrough

Engineering walkthrough — ElevenLabs narration, HeyGen bookends, API vs claude.ai defaults, and official Anthropic B-roll (~6 min).

FieldDetail
DateGeneral availability 9 June 2026
VendorAnthropic
ProductsClaude Fable 5 (GA); Claude Mythos 5 (Glasswing cyber partners only)
API model IDclaude-fable-5 (Mythos 5 has no general API ID)
AvailabilityAPI and consumption-based Enterprise: full access from launch; claude.ai and third-party surfaces; subscription plans staged through 22 June 2026
Pricing$10/M input tokens, $50/M output tokens (less than half Mythos Preview)
Subscription windowIncluded on Pro, Max, Team, and seat-based Enterprise through 22 June 2026; usage credits from 23 June until capacity allows reinclusion
SafeguardsCyber, bio/chem, and distillation classifiers route to Opus 4.8 with user notification; triggers in <5% of sessions on average (>95% run Fable with Mythos-equivalent performance)
Data retention30-day retention on Mythos-class business traffic (first- and third-party surfaces); not used for training; human access logged

What changed

  • Claude Fable 5 is Anthropic’s first Mythos-class model generally available, with state-of-the-art scores on software engineering, knowledge work, vision, and long-horizon agent benchmarks—lead grows as tasks become longer and more complex per the launch post.
  • New safety classifiers extend constitutional-classifier work: cyber (exploitation plus offensive agentic hacking), biology/chemistry (broad fallback during launch), and distillation (large-scale capability extraction) all route flagged prompts to Claude Opus 4.8 instead of refusals.
  • Claude Mythos 5 shares Fable 5 weights with cyber safeguards lifted for existing Project Glasswing partners upgrading from Mythos Preview; comparable or stronger performance at substantially lower cost.
  • Biology trusted access (separate from Mythos 5) will offer Fable 5 with bio/chem classifiers removed but cyber classifiers still active to a small life-sciences cohort—broader enrolment planned as safeguards narrow.
  • Pricing halved versus Mythos Preview on API and consumption-based Enterprise plans.
  • 30-day retention is required for Mythos-class business traffic to detect novel jailbreaks; data deleted after 30 days with logged human access (Anthropic support article).
  • Red-team validation: external bug bounty reported no universal jailbreak in 1,000+ hours; zero compliance on harmful single-turn cyber requests across 30 public jailbreak techniques in partner testing.
  • Subscription rollout is demand-sensitive: included at no extra cost on paid Claude plans through 22 June 2026, then usage credits until capacity stabilises.

Capability evidence for builders

  • Software engineering: Stripe reported a 50-million-line Ruby migration in one day (versus an estimated two-plus months manually); Cognition’s FrontierCode ranks Fable 5 highest among frontier models at medium effort with improved token efficiency.
  • Knowledge work: highest score on Hebbia’s Finance Benchmark; IMC reported near-perfect trading-analysis results across factual lookup, root-cause analysis, and expected-value reasoning.
  • Vision: state-of-the-art on vision tasks; completed Pokémon FireRed vision-only without navigation harnesses that prior Claude models required.
  • Memory: on Slay the Spire agent runs, file-based memory produced threefold improvement versus Opus 4.8 and threefold higher final-act completion rates.
  • Alignment: automated assessments place Mythos 5 misaligned behaviour similar to Opus 4.8 per the system card.

Why it matters for engineers

Teams wiring production agents must treat Fable 5 as a two-model endpoint: more than 95% of sessions never trigger fallback, but cyber-hardening, bioinformatics, or suspicious bulk-extraction patterns transparently downgrade to Opus 4.8 with user notification. Log response metadata and surface fallback events to operators—latency and capability profiles differ, and conservative classifier tuning means benign security research queries can still trip safeguards during the launch window.

The API and consumption-based Enterprise path is the reliable integration surface from day one. Subscription inclusion is time-boxed and demand-sensitive; capacity planning for long autonomous coding runs should prefer metered API tiers. Mythos 5 remains outside general API access—cyber defenders need Glasswing or a future trusted-access application; biology researchers follow the separate Fable-without-bio-classifiers programme.

Long-context and file-backed memory improvements matter for multi-hour agent loops: Fable 5 sustains focus across millions of tokens and benefits disproportionately from persistent notes versus Opus 4.8. Vision-only harnesses now complete screenshot-to-code and scientific-figure extraction tasks that previously required scaffolding.

Regulated workloads must account for 30-day Mythos-class retention on business traffic, logged human access to stored prompts, and explicit prohibition on training use. Benchmark harnesses that resemble distillation attacks may trigger classifiers—design eval pipelines to tolerate Opus 4.8 fallbacks or isolate test traffic from production API keys.

Frontier model with automatic safe fallback when classifiers route sensitive queries to Opus 4.8

Most Fable 5 sessions run at full frontier capability; cyber, biology, and distillation classifiers route sensitive prompts to Opus 4.8 instead of blocking.

Classifier fallback in production

flowchart LR
  REQ["Agent or app request"]
  CLS["Safety classifiers"]
  FABLE["Fable 5 response"]
  OPUS["Opus 4.8 fallback"]
  OUT["Answer delivered"]
  REQ --> CLS
  CLS -->|"typical workload"| FABLE
  CLS -->|"cyber bio distillation"| OPUS
  FABLE --> OUT
  OPUS --> OUT

Research supplement

Web search was not available in this environment. The following context is drawn from the article and linked reference materials only.

The classifier-fallback approach described in Fable 5 relates to broader AI safety literature on output filtering versus refusal. Anthropic's published safety work (ASL-3 and higher commitments) has flagged cyber and CBRN (chemical, biological, radiological, nuclear) as priority dual-use categories — the three Fable 5 classifier domains (cyber, bio/chem, distillation) map directly onto these commitments. The system card cited in the article (claude-fable-5-mythos-5-system-card) is the primary source for evaluating classifier accuracy claims independently.

Project Glasswing is described at anthropic.com/glasswing as a defenders-focused initiative; the article does not reproduce its full scope. Engineers evaluating Mythos 5 access should consult that page directly for enrollment criteria.

The API model ID (claude-fable-5) and current pricing are listed in Anthropic's models overview at platform.claude.com/docs/en/about-claude/models/overview, which is the authoritative source for integration and should be checked against the article's stated rates before capacity planning.

References

Categories
News

Google Colab CLI: Provision GPUs and Run Scripts from Your Terminal

Google Colab CLI turns Colab from a browser-only notebook into a programmable remote runtime you drive from your terminal — provision a T4 or A100, pipe a local .py file to a Jupyter kernel in the cloud, pull checkpoints back, and tear the VM down, without opening a tab. Google shipped it in June 2026 as an agent-ready bridge between local dev machines and Colab compute.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
  T[Local terminal] -->|colab new / exec| API[Colab assign API]
  API -->|runtime proxy token| VM[Remote Colab VM]
  VM --> K[Jupyter kernel]
  K --> GPU[GPU or TPU]
  VM -->|colab download| A[Local artifacts]
  API -->|keep-alive 60s| VM

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class T,A agent
  class API,VM,K,GPU hook

What problem it solves

Before the CLI, Colab meant: open a notebook in Chrome, click Connect, upload files manually, and babysit the runtime. That breaks down for shell pipelines, CI-style jobs, and coding agents that only speak bash. The CLI exposes the same rented VMs through commands like colab new --gpu T4, colab exec -f train.py, and colab run --gpu T4 train.py — a one-shot provision → execute → teardown path.

Google’s launch post positions it for both humans and agents: any tool with terminal access (Claude Code, Codex, Antigravity, etc.) can provision accelerators, install packages with uv, run local scripts remotely, export replayable .ipynb logs, and download weights — without writing cloud provisioning code yourself.

How the architecture works

LayerWhat it doesWhere it lives
CLI (Typer)Commands, session names, authYour Mac or Linux machine
Assign APIAllocate VM, return endpoint + proxy tokencolab.research.google.com/tun/m/assign
Keep-alive daemonPing every 60s; 24h capDetached local process per session
Jupyter kernelExecute Python via WebSocketRemote VM (/content cwd)
Contents APIUpload/download/list filesSame VM via Jupyter HTTP
Local stateSession metadata, kernel id~/.config/colab-cli/sessions.json

Important detail: colab exec -f script.py reads the file locally and sends source to the kernel — you do not need a separate upload step for execution. Use colab upload / colab download for datasets, checkpoints, and zips.

Install and authenticate

# Recommended
uv tool install google-colab-cli

# Or pip (requires Python 3.13+)
pip install google-colab-cli

# Quick smoke test
colab new
echo "print('Hello from Colab')" | colab exec
colab stop

Two auth layers matter:

  • CLI → Colab control plane--auth oauth2 (browser flow, token in ~/.config/colab-cli/token.json) or --auth adc (Application Default Credentials — preferred for agents).
  • VM → GCP servicescolab auth inside a session for BigQuery/GCS; separate from CLI login.
# Agent-friendly ADC setup (all required scopes)
gcloud auth application-default login \
  --scopes=openid,\
https://www.googleapis.com/auth/cloud-platform,\
https://www.googleapis.com/auth/userinfo.email,\
https://www.googleapis.com/auth/colaboratory

colab --auth=adc whoami
colab --auth=adc new -s my-job

colab new pre-flights scopes: if the colaboratory scope is missing, it unassigns the fresh VM and prints remediation — avoiding silent 403s mid-job.

Command map

GroupCommandJob
Sessioncolab new [-s NAME] [--gpu GPU] [--tpu TPU]Allocate VM + start keep-alive
Sessioncolab sessions / colab status -s NAMEList / inspect hardware + IDLE/BUSY
Sessioncolab stop -s NAMEKill daemon, shutdown kernel, release VM
Sessioncolab url -s NAME [--open]Browser link to attach to CLI session
Executecolab exec [-s NAME] [-f FILE] [--output-image PATH]Run stdin, .py, or .ipynb
Executecolab repl / colab consoleInteractive Python or raw tmux shell
Executecolab run [--gpu GPU] [--keep] script.py [args]One-shot new + exec + stop
Filescolab upload / download / ls / rm / editJupyter Contents API wrappers
VM setupcolab install [-r requirements.txt] PKG...uv pip install --system (falls back to pip)
VM setupcolab drivemount / colab authDrive + GCP creds (interactive)
Logscolab log -o run.ipynbExport history as ipynb/md/jsonl
Agentcolab skillPrint bundled COLAB_SKILL.md

GPU and TPU options

FlagAcceleratorTypical use
(none)CPULight scripts, orchestration tests
--gpu T4NVIDIA T4Fine-tuning, inference smoke tests
--gpu L4NVIDIA L4Efficient inference/training
--gpu G4NVIDIA G4Graphics/ML workloads
--gpu A100NVIDIA A100Large-model training
--gpu H100NVIDIA H100Top-tier training (tier-gated)
--tpu v5e1TPU v5eTPU-native JAX/Flax jobs
--tpu v6e1TPU v6eNewer TPU slice

Accelerator access is subscription- and quota-gated. HTTP 400 on colab new --gpu X usually means no entitlement — fall back to T4 or CPU. Unrecognized --gpu values silently map to A100 in the client; spell GPU names exactly.

Built for coding agents

Five-step agent workflow with Colab CLI from provision to cleanup
The CLI ships COLAB_SKILL.md via colab skill — agents get session rules, safe commands, and ADC auth without scraping the README.

Google’s Gemma fine-tuning demo is the canonical agent pattern:

colab new --gpu T4
colab install transformers datasets peft trl bitsandbytes accelerate
colab exec -f finetune_run.py
colab download checkpoints/adapter ./adapter
colab log --output gemma_finetune_log.ipynb
colab stop

Agent-safe: new, stop, exec (piped/file), run, install, upload, download, log. Agent-unsafe (TTY): unpiped repl/console, auth, drivemount.

For parallel jobs, isolate state: colab --config /tmp/job-a.json new -s trainer-a. Always name sessions and call colab stop — idle VMs burn compute units even with keep-alive.

Shebang one-liners with colab run

#!/usr/bin/env -S colab run --gpu L4 --keep
import torch
print(torch.cuda.is_available(), torch.cuda.get_device_name(0))

chmod +x script.py && ./script.py provisions a fresh VM, runs the script with forwarded sys.argv, propagates exit codes, and tears down unless --keep is set. CLI status messages go to stderr; script stdout stays clean for piping.

Three workflows that cover most jobs

1. Training with checkpoint pull

colab new -s trainer --gpu A100
colab install -s trainer torch transformers
colab exec -s trainer -f train.py
colab download -s trainer checkpoints/model.bin ./model.bin
colab stop -s trainer

2. Local notebook on cloud kernel

colab new -s analysis
colab exec -s analysis -f report.ipynb   # writes report_output.ipynb locally
colab log -s analysis -o execution_log.md
colab stop -s analysis

3. Fire-and-forget GPU job

colab run --gpu T4 train.py --epochs 3 --lr 1e-4

Hybrid tip: colab url -s NAME --open attaches the browser UI to a CLI-provisioned VM — explore in the notebook, automate in the shell.

CLI vs browser-only Colab

Browser notebookColab CLI
InterfaceCells, widgets, plots inlineTerminal, scripts, CI, agents
Session startConnect button in tabcolab new / colab run
Keep-aliveBrowser activity (~90 min idle)Detached daemon (24h cap)
File syncManual upload UIupload/download + exec without upload
AutomationLimited headlessNative pipelines, shebang, agent loops
Agent pathColab MCP Server (in-notebook)COLAB_SKILL.md + bash tools
PlatformAny browserLinux and macOS only (no Windows yet)

Limits and footguns

ConstraintImpactMitigation
Python 3.13+ requiredOlder system Python won’t installUse uv tool install
Compute unitsBillable while VM runscolab stop; use run for ephemeral jobs
Default exec timeout 30sLong training may look “hung”Pass --timeout on exec/run
Kernel persistsState leaks between exec callsrestart-kernel or fresh session
Interactive commandsBlock agentsPipe stdin or use exec -f
GPU quota400 on assignFall back CPU/T4; check colab pay

Performance summary

DimensionBefore CLIWith Colab CLI
Provision GPUBrowser connect + UI clickscolab new --gpu T4 from shell
Run local code remotelyUpload + paste cellscolab exec -f script.py
One-shot jobsManual lifecyclecolab run or shebang
Agent integrationCustom Selenium / MCP onlyBundled COLAB_SKILL.md
Artifact recoveryManual download UIcolab download + colab log
Headless keep-aliveTab must stay open60s daemon, no browser
Packagepip in cellscolab install via uv
Latest releasev0.5.9 (PyPI, Jun 2026)

Research supplement

Web search was unavailable in this environment. The research supplement is left empty pending external verification of specific Colab CLI documentation, authentication details, and quota behaviour.

---

References

Categories
News

Loop Engineering: Design Coding-Agent Systems Instead of Prompting Every Turn

Loop engineering means you stop being the person who types every prompt to a coding agent — and start designing a small system that discovers work, delegates it, checks it, remembers progress, and repeats. The leverage moves from prompt craft to loop design: six primitives that now ship inside tools like Claude Code and the Codex app instead of bespoke bash you maintain forever.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  subgraph Stack["Three layers"]
    H[Harness engineering] --> L[Loop engineering]
    L --> O[Orchestration layer]
  end
  H -->|one agent runtime| T[Tools memory sandbox]
  L -->|schedule + verify| P[Six primitives]
  O -->|fleet + PR lifecycle| R[Reactions state machine]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class H,L,O agent
  class T,P,R hook

Where the conversation landed in 2026

The shift is no longer niche. Boris Cherny, who leads Claude Code at Anthropic, described it on the Acquired podcast as: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figure out what to do. My job is to write loops.” Peter Steinberger put the same idea on X: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Both are saying the human job moved up one floor — from typing each turn to designing feedback systems.

That floor has three names in practice. Harness engineering is the runtime around one agent (tools, memory, permissions). Loop engineering is the harness that runs on a schedule, spawns helpers, and feeds itself from disk. Orchestration is the layer above when you need fleets of agents across worktrees, PRs, and CI — with automatic routing of failures back to the right session.

The universal five-stage cycle

Five stages of a coding agent loop: discover plan execute verify iterate
Every serious loop — single agent or fleet — runs the same cycle until a verifiable stop condition holds.
StageWhat happensTypical tooling
DiscoverFind work: CI failures, issues, diffs, inboxAutomations, /loop, triage skills
PlanBreak goal into steps with constraintsSkills, VISION.md, spec sub-agent
ExecuteEdit code, run tools, open PRsWorktrees, MCP connectors
VerifyPush against objective signals — not model opinionTests, lint, /goal evaluator, critic sub-agent
IterateFix gaps and loop againStop hooks, reactions, state file

A prompt gives instructions for one turn. A loop gives a job: discover → plan → execute → verify → iterate until done. You set the goal; the loop runs itself.

Open loops vs closed loops

Open loopClosed loop
NatureExploratory; wide search spaceBounded path you designed
RiskToken burn; “slop machine” without gatesCheaper; predictable
NeedsLarge budget + strong evaluatorsClear goal, defined steps, stop condition
Start here?Research spikes, benchmarksProduction coding, triage, migrations

Closed loops need five ingredients on disk: goal (precise done), context (VISION.md, ARCHITECTURE.md, RULES.md), action (scoped tools), feedback (tests, lint, structured errors), and a stop condition (/goal text, Stop hook, or orchestrator brief). Without a quality gate, AI drifts; with one, it improves.

Single-agent loop vs fleet loop

Single-agent loopFleet loop
ShapeOne brain runs discover→verify end-to-endOrchestrator splits work across specialists
Good forFocused refactors, /goal migrationsLarge features, parallel PRs, research→build→QA chains
Token profile~50K–200K tokens per medium coding task~500K–2M+ when orchestrator + 3+ specialists run
Example splitExplore → implement → verify sub-agentsResearch specialist → engineering specialist → QA specialist, each with its own loop

What changed in agentic development

For roughly two years, “good AI coding” meant writing strong prompts and feeding enough context each turn. You typed, read, typed again — the agent was a power tool and you held the handle every step.

Loop engineering is the next layer: a recursive goal where you define purpose and done, and the system iterates until a verifiable condition holds. You design once; the loop pokes agents on a schedule or across turns. This sits one floor above agent harness engineering (the environment one agent runs in) and the factory model (the system that builds software) — same family of ideas, but the harness now runs on a timer, spawns helpers, and feeds itself from disk-based memory.

The six primitives every loop needs

Six building blocks of a coding agent loop
Five action primitives plus persistent state — the shape is the same across major coding-agent products.
#PrimitiveJob in the loopWithout it
1AutomationsScheduled discovery and triageYou manually check CI, issues, and diffs
2WorktreesIsolate parallel agent checkoutsTwo agents overwrite the same files
3SkillsProject knowledge on disk (SKILL.md)Agent re-guesses conventions every run
4Connectors (MCP)Issues, DB, Slack, staging APIsAgent only sees the filesystem
5Sub-agentsSeparate maker and checker rolesOne model grades its own homework
6State / memoryMarkdown, Linear board, AGENTS.mdModel forgets between runs; loop restarts blind

The agent forgets; the repo does not. Long-running loops depend on external state — not context window — to remember what was tried, what passed, and what is next. Common context files beyond SKILL.md: VISION.md (what success looks like), ARCHITECTURE.md (stack and layout), RULES.md (forbidden actions), GUARDRAILS.md (always-on checklists), and AGENTS.md (repo map for agents).

Codex app vs Claude Code — same shape, different names

PrimitiveCodex appClaude Code
AutomationsAutomations tab: project, prompt, cadence, local or worktree env; Triage inbox; thread vs standalone runs/loop, Desktop scheduled tasks, Cloud Routines (/schedule), hooks, GitHub Actions
WorktreesBuilt-in per threadgit worktree, --worktree, isolation: worktree on subagents
SkillsSKILL.md, invoke with $name or /skillsSame SKILL.md folder format; bundled /loop, /code-review
ConnectorsMCP connectors + pluginsMCP servers + plugins; routine connectors on claude.ai
Sub-agentsTOML in .codex/agents/.claude/agents/, agent teams
StateMarkdown / Linear via connector; thread memoryAGENTS.md, progress files, prd.json-style task queues

Once you see the shared shape, the debate shifts from “which tool” to “which loop design still works in either seat.”

1. Automations — the heartbeat

Automations turn a one-off agent run into a loop. In the Codex app you configure project, prompt, schedule, and environment (local checkout or background worktree). Runs with findings land in a Triage inbox; empty runs archive themselves. Internal uses include daily issue triage, CI failure summaries, commit briefings, and regression hunts. Automations can call $skill-name so recurring logic stays maintainable.

Claude Code reaches the same outcome via /loop (interval reruns), cron scheduling, lifecycle hooks, Desktop scheduled tasks (persistent while app is open), Cloud Routines (runs when laptop is closed), or GitHub Actions for headless runs.

Interactive pick: /goal vs /loop vs Stop hooks

MechanismNext turn starts when…Stops when…Best for
/goal (Claude)Previous turn finishesSeparate evaluator model confirms condition (reads transcript only)Migrations, refactors, “all tests green”
/goal (Codex)Thread idle after turnEvidence in thread supports completion; pause/resume/clear/budgetMulti-hour tuning, benchmarks, long refactors
/loopTime interval elapsesYou stop it or agent decides donePolling deploys, periodic summaries, PR babysitting
Stop hookPrevious turn finishesYour script, prompt hook, or agent hook decidesRalph-style loops, org-wide completion rules
# Claude Code — run until tests and lint are clean (v2.1.139+)
/goal all tests in test/auth pass and the lint step is clean

# Check spend and evaluator reasoning
/goal

# Stop early
/goal clear

# Headless single invocation
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"

# Codex — long-running performance goal (cookbook pattern)
/goal Reduce p95 checkout latency below 120 ms, verified by the checkout benchmark,
while keeping the correctness suite green. If blocked, stop with evidence.

/goal on Claude Code starts a turn immediately; after each turn Haiku (by default) judges yes/no from the transcript — it does not run tools. Codex /goal is thread-scoped with explicit budget accounting and pause/resume. Pair either with auto mode so each turn skips per-tool confirmations.

2. Worktrees — parallel without collisions

Two agents editing the same file is the same failure mode as two engineers on one branch without coordination. A git worktree is a separate working directory on its own branch, sharing history but not files. Codex threads use worktrees natively; Claude Code offers --worktree sessions and isolation: worktree on subagents that clean up after themselves.

Worktrees remove mechanical collision; your review bandwidth still caps how many parallel agents you can actually supervise.

3. Skills — stop paying intent debt every session

Agents start cold. Every missing convention becomes a confident guess — intent debt. A skill is intent written outside the chat: a folder with SKILL.md, optional scripts, references, and assets. Both Codex and Claude Code load skills when you invoke $name or when the task matches a tight, boring description (clever descriptions match too often).

# Example skill layout
my-project-skill/
  SKILL.md          # conventions, build steps, forbidden patterns
  scripts/
  references/

Skill vs plugin: the skill is the authoring format; a plugin bundles skills and connectors for teammates to install once.

4. Connectors — act in your real environment

MCP connectors let the loop read Linear/Jira, query databases, hit staging APIs, and post to Slack. That is the difference between “here is the fix” and “open the PR, link the ticket, ping the channel when CI is green.” Plugins package connectors with skills so onboarding is one install, not tribal memory.

Feedback signals that keep loops honest

Hierarchy of agent loop feedback signals from tests to self-critique
A loop with nothing to push against is just the agent agreeing with itself — layer deterministic, perceptual, and critic signals.
Signal typeExamplesStrength
Deterministic oraclesCI, unit tests, type checks, linters, git diff, scalar metrics (e.g. benchmark p95)Strongest — pass/fail without model judgment
Perceptual / visualPlaywright, browser MCP tools, layout screenshotsMedium — catches UI regressions code tests miss
Critic sub-agentsSeparate reviewer agent; forces retry or stopMedium — judgment, but not the worker context
Persistent contextGUARDRAILS.md, skills, checklists loaded every runAlways-on oracle
LLM self-critique only“Does this look good?” from same modelWeakest — rationalises its own mistakes

Strongest systems stack multiple signal types: deterministic for reliability, visual/critic for judgment, human gates on high-stakes merges. Signals must route back automatically — full logs, diffs, scores — without you copy-pasting CI output each turn.

5. Sub-agents — maker vs checker

Maker agent and checker agent split in a coding loop
The highest-leverage split: implement in one agent, verify in another — including /goal’s separate done-evaluator.

The model that wrote the code is too lenient grading itself. A second agent — different instructions, sometimes a different model — catches rationalised mistakes. Typical trio: explore, implement, verify against spec. In fleet setups, a validator agent reports truth without fixing — failures loop back to the builder.

# Codex — custom subagent (simplified .codex/agents/security-reviewer.toml)
name = "security-reviewer"
description = "Read-only security pass on diffs"
instructions = "Find auth, injection, and secret-leak risks. No edits."
model = "strong"
reasoning_effort = "high"

Sub-agents cost extra tokens (each runs its own model + tools). Spend them where a second opinion unlocks unattended runs — the only reason you can walk away from a loop.

Orchestration — when one loop is not enough

Single-session /goal loops solve “finish this migration without me re-prompting.” Fleet-scale work needs an orchestration layer: deterministic plumbing plus an orchestrator agent for judgment.

LayerJobExamples
Deterministic plumbingRoute environmental feedback automaticallyCI fail → inject logs into worker session; PR conflict → notify right agent; lifecycle state machine (working → ci_failed → review_pending → merged)
Orchestrator agentDecompose goals, write briefs, batch parallel workResearch agent → spec → tracking issue → N workers in isolated worktrees
Human gatesVision, acceptance, high-risk mergesTriage inbox, PR approval — optimise human time, not remove humans

Open-source reference implementations like Agent Orchestrator (npm install -g @aoagents/ao) ship reactions engines, worktree isolation, and orchestrator prompts out of the box. The pattern: inner agents execute in bounded loops; outer orchestrator coordinates; environmental signals keep loops honest; you stay on vision and judgment.

Walkthrough: one morning triage loop

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
sequenceDiagram
  participant Auto as Morning automation
  participant Skill as Triage skill
  participant State as STATE.md
  participant WT as Worktree
  participant Maker as Fix sub-agent
  participant Check as Review sub-agent
  participant MCP as Connectors

  Auto->>Skill: Run on schedule
  Skill->>State: Write CI failures + issues
  loop Each actionable item
    Auto->>WT: Open isolated checkout
    WT->>Maker: Draft fix
    Maker->>Check: Submit diff
    Check-->>Maker: Approve or reject
    Maker->>MCP: Open PR + update ticket
  end
  Auto->>State: Log done / blocked for human inbox
  • 06:00 — Automation fires; triage skill reads yesterday’s CI, open issues, recent commits.
  • Findings — Written to STATE.md or a Linear board (memory outside the chat).
  • Per item — New worktree → maker sub-agent drafts fix → checker sub-agent runs against project skills + tests.
  • Ship — Connectors open PR and update tickets; blocked items land in your inbox.
  • Tomorrow — State file tells the loop what was tried, passed, or still open.

You designed this once. You did not prompt each step — that is the whole point.

Prompt engineer vs loop engineer

Prompt engineerLoop engineer
Crafts better instructions per turnDesigns feedback cycles and stop conditions
Linguistic skillSystems / software engineering skill
Better single outputReliable verified outcomes across runs
You review manually each timeSystem self-corrects against oracles
You are the feedback loopThe loop is the feedback loop
“Write me a function”“Write → test → fix until green”

Self-check: is your loop healthy?

QuestionHealthy loopLeaky loop
What proves “done”?Tests, lint, measurable condition in /goalAgent says “looks good”
Where does memory live?Repo file or issue trackerOnly in chat context
Who verifies?Separate sub-agent or evaluator modelSame agent that wrote code
What pushes back?Layered oracles (CI + critic + human gate)Self-critique only
Parallelism?One worktree per agentShared checkout
Token budget?Turn cap in condition or manual clearOpen-ended overnight /goal
Your role?Review merged outcomes you understandPress go and hope

What loops do not remove — three sharper risks

Verification stays human

An unattended loop is also an unattended mistake machine. Even with a verifier sub-agent, “done” is a claim, not proof. Ship code you confirmed works — especially when diff sizes balloon because agents touch more files than necessary.

Comprehension debt accelerates

The faster the loop ships code you did not write, the wider the gap between what exists and what you understand. Read the reasoning, skim the diff, trace the decision log — or the loop makes the debt grow faster, not slower.

Cognitive surrender

When automation feels smooth, it is tempting to stop having opinions. Loop design with judgement keeps you the engineer; loop design to avoid thinking is the same UI with opposite outcomes. Two teams can run identical loops — one moves faster on work they deeply understand; the other outsources understanding entirely. The loop cannot tell the difference. You can.

Parallel pattern: scheduled content factories

The same week loop engineering went mainstream for coding, creators published parallel “factory” playbooks for media. @0x_fokki’s X Article I Built an AI Animation Factory That Runs 24/7 is not a coding-agent harness — Claude is used as a scriptwriter, not a repo editor — but it shows the same design move: stop hand-driving each step, design a pipeline that runs on a schedule with human approval gates.

Coding loop and content factory share the same scheduled pipeline shape
Same loop instinct in two domains — you design the system and the gates, not every intermediate prompt.

Fokki’s pipeline chains six tools end-to-end:

Claude → Midjourney → Runway → ElevenLabs → Suno → Make
script → frames → motion → voice → music → publish

One Make scenario runs Monday and Thursday at 08:00: pull scripts from Google Drive, batch Midjourney scene prompts, download frames, send dialogue to ElevenLabs, pair images with Runway motion clips, assemble in a CapCut template, upload to YouTube with generated metadata, clip a 30-second X preview, post Patreon early access, and ping Telegram on completion. A separate on-demand webhook turns client briefs into finished explainers in shared Drive — quoted turnaround ~6 hours after a one-time ~5-hour setup.

Four SKUs share the pipeline: animated story series (6–10 min), brand explainers (60–90 sec), motion comics, and children’s bedtime channels. The human job is narrow: pick the story, pick the style, approve the output — roughly four hours of direction for a “24/7” factory, per the author.

Loop-engineering primitiveFokki factory analogueKey difference
AutomationsMake.com schedule + webhookNo /goal or hooks — cron-style triggers only
Skills / context on diskReusable Midjourney character sheets, CapCut templates, voice cast notesCreative consistency prompts, not SKILL.md
Sub-agent splitTool specialization per stage (script vs frames vs motion)No verifier sub-agent — human approves final cut
ConnectorsDrive, YouTube, Patreon, Telegram APIsDistribution stack, not MCP issue trackers
Feedback signalViews, RPM, client acceptanceBusiness metrics — not CI, lint, or test gates
State / memoryOrganised Drive folders per episodeAsset library, not AGENTS.md

What transfers to coding loops

  • Scheduled heartbeat — the factory does not wait for you to open a chat; neither should triage or CI-repair loops.
  • Stage-specialised tools — one brain trying to script, illustrate, animate, and score is the creative version of one agent grading its own code.
  • Performance direction in prompts — Fokki writes ElevenLabs stage direction (pauses, volume drops), not raw dialogue paste; coding loops need equally explicit done conditions in /goal text.
  • Human gate on output — “approve the episode” maps to Triage inbox review and PR merge — optimise human time, do not remove judgment.
  • Setup once, run indefinitely — the Make scenario is the media equivalent of wiring automations + skills once, then letting the loop compound.

Treat revenue figures in social factory posts as illustrative, not audited benchmarks. The architectural lesson is stable: factories — code or content — are designed loops with explicit stages, schedules, and gates. Coding loop engineering just demands harder oracles (tests, type checks, diffs) because “shipped” is easier to fake than “sounds convincing.”

Token economics and balance

PatternApproximate token loadMitigation
Single-agent medium coding loop50K–200K per runTurn caps in /goal; cheaper model for explore/review
Fleet (orchestrator + 3 specialists)500K–2M+ per cycleBatch only parallelisable work; stuck detection
Scheduled daily automationMillions per week if always-onArchive empty runs; scope skills tightly
Sub-agents + /goal evaluatorMultiplicative per child sessionSpend sub-agents on high-risk paths only

Loops are not free — patterns diverge wildly if you are “token rich” vs “token poor.” Direct prompting still matters for ambiguity and architecture. Loops handle repetition; you handle judgement. The leverage point moved — it did not disappear.

Performance summary

DimensionPrompt eraLoop era
Your jobWrite each turnDesign discover → plan → execute → verify → remember
Core cycleAsk → answerFive stages until verifiable done
PrimitivesContext + prompt6 shared building blocks (both major tools)
Done signalYou decide to stop/goal evaluator, Stop hook, or environmental oracles
ScaleOne threadWorktrees + sub-agents + orchestration layer
FeedbackYour eyesLayered oracles — not self-critique alone
KnowledgeRe-explained each sessionSkills + VISION.md / AGENTS.md compound
Risk profileSlower, more oversightFaster, higher verification + comprehension debt
Bottom lineBuild the loop — stay the engineer who reviews what ships

Research supplement

The following documentation pages from the official Claude Code docs provide additional technical depth beyond the article's reference links:

  • Scheduled Tasks (/loop): The Scheduled Tasks reference details how /loop works alongside cloud Routines and Desktop scheduled tasks, including the full comparison table of scheduling options, jitter behaviour, seven-day expiry, and the loop.md customisation mechanism. Notably, dynamic /loop schedules can use the Monitor tool internally to stream background process output, avoiding polling entirely.
  • Agent Loop Architecture: The Agent SDK: How the agent loop works page documents the full turn-and-message lifecycle, context window management, automatic compaction, and how max_turns / maxBudgetUsd bounds apply. It also explains how subagents start with a fresh conversation context, which has direct implications for keeping loop context efficient over long runs.

Key technical detail not in the primary reference links: The /goal command is implemented as a session-scoped prompt-based Stop hook. This means developers who need evaluation logic beyond a short text condition (for example, running an actual script to verify state) can write a custom Stop hook instead — which gives them the same turn-by-turn evaluation model with full scripting power.

---

References