Most developers still prompt coding agents one task at a time — and become the bottleneck. Daniel Moka argues the shift is agentic loops: bounded systems that discover, plan, execute, verify, and iterate until a hard gate passes, while you own the rules on disk. Average developers write prompts for agents; great developers design loops that prompt agents.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
G[Goal + RULES.md on disk] --> D[Discover]
D --> P[Plan]
P --> E[Execute in worktree]
E --> V[Verifier agent]
V --> Q{Quality gate tests lint CI}
Q -->|pass| S[Ship PR notify]
Q -->|fail| R[Append lesson to RULES.md]
R --> D
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
classDef decision fill:#444,color:#fff
class D,P,E,V agent
class G,R hook
class Q decision
Eight tips from the post (expanded)
#
Tip
Why it matters
Practice
1
Use closed loops
Open loops roam and burn 50K–2M+ tokens per run
Define path, steps, checks, and stop condition before agents run
2
Loop only repeatable work with checkable done
No gate = no loop — agent “finishes” confidently wrong
Task must repeat often, have automatic done-check, cheap rollback
3
Parallel agents in separate git worktrees
File collisions kill parallel speed
One branch copy per session — GitHub Copilot app pattern
4
Separate verifier agent
Maker grading own homework inherits blind spots
Independent context — never share worker reasoning trail
5
Quality gates = tests and linters, not LLM output
LLM-on-LLM review shares correlated failure modes
Compiler, types, integration tests, mutation tests, CI
6
RULES.md for repeated mistakes
Chat memory wipes between runs
Disk-based lessons the loop reads every pass
7
Humans ownRULES.md
Agent rewriting guardrails turns bugs into policy
Agent may draft rules; human approves permanent entries
8
Start with Claude /goal
Built-in loop until completion condition met
Tie condition to hard checks — not “agent said it works”
A closed loop improves each pass: fail → analyze → encode rule → gate enforces on the next run.
Open vs closed loops
Type
Behaviour
Token risk
When to use
Open
Wide exploration, underspecified goals
50K–200K single agent; 500K–2M+ fleets
Research spikes only — not production default
Closed
Bounded goal, defined steps, gate each pass
Predictable — stops or escalates
Repeatable engineering work that compounds
Moka’s Loop Engineering 101 article (12 June 2026) frames the closed loop as the one that improves: each pass feeds the next, so the loop you run a month from now is sharper than day one.
Six building blocks of a closed loop
Block
Role
Automations
Heartbeat — schedule, issue event, failing build triggers the loop
Worktrees
Isolated branches so parallel makers never collide
Skills
VISION.md, RULES.md — project knowledge on disk, not chat
Plugins / connectors
PRs, tickets, CI, Slack — loop reaches real tools
Subagents
Maker writes; checker verifies — never the same agent
Memory
State outside the conversation — loop does not start cold
Moka’s closing frame matches the broader loop-engineering movement (Steinberger, Cherny): you replace yourself as the thing that prompts the agent. You step in at decision points — approving rules, merging PRs, escalating unfixable failures — while the loop decomposes, executes, gates, and learns. Pragmatic middle path: loop what repeats and checks; prompt by hand for the rest.
Web search was unavailable in this session. No externally verified sources could be retrieved. The analysis draws exclusively from the article text and general domain knowledge. Claims about token cost ranges and specific tooling versions (e.g., Claude Code v2.1.139+) should be verified against official Anthropic documentation and the original LinkedIn post before being cited as authoritative.
A GitHub repository with one markdown file and zero lines of code now has more stars than most open-source frameworks — because it encodes what every developer already knows but AI coding agents keep ignoring. Claude Skills turn that frustration into a reusable behavioural contract: write once, load automatically, compound across every session.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
P[Repeated prompts] -->|every session| A[Agent]
S[SKILL.md or CLAUDE.md] -->|startup metadata| A
A -->|task match| L[Load full instructions]
L --> W[Workflow scripts assets]
W --> O[Specialised output]
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
classDef decision fill:#444,color:#fff
class A agent
class S,L,W hook
class P decision
What actually went viral
In January 2026, Andrej Karpathy posted publicly about frustrations with AI coding agents — silent assumptions, over-engineering, scope creep. Within 48 hours, developer Forrest Chang distilled those observations into a single CLAUDE.md file and pushed it to GitHub as multica-ai/andrej-karpathy-skills.
Fact
Detail
Stars (live)
~176,000+ on GitHub API (article cited 144K in early June)
Contents
One CLAUDE.md file, ~66 lines, zero executable code
Author
Forrest Chang — not written or endorsed by Karpathy
Purpose
Persistent behavioural guidelines for Claude Code at project root
Forks
~18,000+
The star count signals widespread frustration more than technical sophistication — but the adoption pattern is real. A text file compressed a workflow problem into something installable in one move.
The four rules inside CLAUDE.md
The viral file encodes a collaboration contract — caution over speed, with judgment for trivial tasks.
Rule
Plain English
Stops
1 · Think Before Coding
State assumptions; ask when unclear; surface tradeoffs
Silent interpretation picks and wrong implementations
2 · Simplicity First
Minimum code for the ask — no speculative abstractions
200-line rewrites that should be 50 lines
3 · Surgical Changes
Touch only what the request requires; match existing style
Drive-by refactors and unrelated “improvements”
4 · Goal-Driven Execution
Define verifiable success criteria; loop until checked
Weak “make it work” without tests or proof
# Install the viral file into your project root
curl -o CLAUDE.md https://raw.githubusercontent.com/multica-ai/andrej-karpathy-skills/main/CLAUDE.md
# Claude Code plugin path (community)
/plugin install andrej-karpathy-skills@karpathy-skills
CLAUDE.md vs SKILL.md — same idea, different harness
CLAUDE.md
SKILL.md (Agent Skills)
Scope
Project-wide behavioural context for Claude Code
Modular capability package in a skill directory
Format
Plain markdown (optional frontmatter)
YAML frontmatter (name, description) + markdown body
Loading
Read at session start from project root
Progressive disclosure — metadata at startup, full body on task match
Anthropic launched Agent Skills on 16 October 2025. The official anthropics/skills repository now holds ~151,000 stars — production examples for PDF, Word, Excel, frontend design, MCP builder, and more. On 18 December 2025, Anthropic published Skills as a cross-platform open standard adopted by GitHub Copilot, VS Code, Cursor, Gemini CLI, OpenAI Codex, and dozens of other clients.
How progressive disclosure keeps context lean
Level 1 — name + description preloaded into system prompt at startup
Level 2 — full SKILL.md body read when Claude judges the skill relevant
Level 3+ — bundled reference.md, scripts, templates loaded only when needed
Agents with filesystem access do not need the entire skill in context upfront — the bundled context can grow effectively unbounded while token use stays scoped to the task.
Skills worth installing now
Skill
What it does
Best for
Karpathy CLAUDE.md
Four behavioural coding rules
Any Claude Code / technical project
frontend-design (Anthropic)
Distinctive UI direction — typography, palette, anti-template defaults
Breaking “Inter + purple gradient” convergence
Decision-making (community)
Maps assumptions, blind spots, tradeoffs before answering
Product, strategy, career decisions
Writing voice (custom)
Sentence rhythm, vocabulary, tone from samples
Emails, articles, client comms — works without Cowork
File organizer + Cowork
Autonomous local file sorting, renaming, dedup
Desktop agent workflows (Cowork GA April 2026)
Skill Creator (Anthropic)
Scaffolds new SKILL.md files to spec
Teams building internal skill libraries
The frontend-design skill explicitly targets distributional convergence — the statistical-average UI look (cream backgrounds, terracotta accents, near-black + acid green). It forces deliberate aesthetic choices before any component code ships.
Three ways to install
Surface
Method
Claude.ai / Cowork
Settings → Skills → upload .skill package or paste SKILL.md
Claude Code
Drop skill folder in project or ~/.claude/skills/; use /install-skill with repo path
GitHub
Clone or download; many repos ship a .skill bundle for drag-and-drop
---
name: my-writing-voice
description: Use whenever I ask you to write emails, posts, or articles. Match my voice below.
---
# My Writing Voice
- Short sentences. One idea at a time.
- Direct — no filler like "it's worth noting".
- Contractions fine. Sounds human.
- Vocabulary: [your words] | Tone: [yours]
Skills vs MCP — complementary, not competing
Simon Willison and others framed Skills as potentially bigger infrastructure than MCP alone for workflow knowledge — procedural “how we do things here” vs MCP’s tool connectivity layer. Anthropic’s own engineering post positions Skills as teaching complex workflows that involve external tools, not replacing MCP servers. Skills encode behaviour; MCP exposes live systems.
Why the ecosystem is self-sustaining
Low barrier — functioning skills can be 20–100 lines; no code required
Cross-model portability — same SKILL.md format works in Cursor, Gemini CLI, Codex CLI
Agent Experience (AX) is the design discipline for working with autonomous agents across days, tools, and artifacts — not the polish of a chat sidebar. When an agent owns an issue for hours, opens a pull request, and answers review comments while you switch tasks, the question stops being “was the prompt pleasant?” and becomes “can I trust, audit, and resume this collaboration?”
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
U[Traditional UX] --> AU[Agent UX]
AU --> AX[Agent Experience AX]
U -->|reactive screens| Q1["Can I use this product?"]
AU -->|chat + control| Q2["Can I interact with this agent?"]
AX -->|relationship over time| Q3["Can I work with this agent over days?"]
AX --> A1[Artifacts issues PRs sessions]
AX --> A2[Parallel workstreams]
AX --> A3[Audit trail chronicle]
AX --> A4[Accountability merge policy]
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
classDef decision fill:#444,color:#fff
class U,AU agent
class AX hook
class Q1,Q2,Q3 decision
UX, Agent UX, and AX — three different design problems
Layer
Software role
Design question
Optimises
UX
Reactive application
Can I use this product?
Friction on screens and workflows
Agent UX
Intelligent collaborator in a chat surface
Can I effectively interact with this agent?
Transparency, control, multimodal triggers, trust in one session
The shift is subtle but decisive. UX optimises interactions. Agent UX optimises conversations. AX optimises collaboration — relationship design once agents act asynchronously, maintain context, and own portions of work. Microsoft’s Agent UX principles (Space, Time, Core) address individual agents; AX extends that lens to multi-session, multi-artifact, multi-agent coordination.
Why AX matters now
GitHub’s rationale for the Copilot desktop app (June 2026) states the bottleneck plainly: agentic development made coding faster but created disjointed workflows, scattered context, and more time reviewing agent-generated code. Internal enablement language matches — once agents generate code quickly, the hard part is managing work coherently across branches, workspaces, validation, review, CI, and merge.
Not a model problem — coherence is an experience-design problem
Async by default — agents continue when the human looks away
Review compounds — more PRs from agents means more judgment surface
Beyond dev tools — any product embedding autonomous agents faces the same delegation question
GitHub reports commits nearly doubled year over year to 1.4 billion per month, with 2 billion Actions minutes per week — agentic workflows are already scaling platform load, not just individual productivity.
Six AX shifts in the GitHub Copilot app
Valentina Alto’s walkthrough of a loyalty-points expiry feature in an e-commerce codebase illustrates AX through GitHub’s agent-native desktop app (technical preview, Build 2026). The feature is incidental; the experience pattern is the lesson.
Step
AX shift
What changes
1 · Artifact start
Unit = tracked work item
Session opens from an issue in My Work, not a blank prompt; context loads automatically; Plan mode starts by default
2 · Plan before act
Intent is inspectable
Agent proposes an implementation plan; developer reviews and edits before any code changes
3 · Persistent workspace
Supervised workstream
Isolated git worktree or cloud sandbox with its own branch; agent researches, edits, runs tests and linters
4 · Parallel coordination
Human as orchestrator
Multiple isolated sessions; switch tasks while agents run; see agents started from GitHub.com in the same My Work view
5 · Continuous delivery path
No context reconstruction
Preview locally, open PR, inspect CI and review activity, spawn a session from the PR for follow-up changes
6 · Durable trail
Work survives the moment
Session history, saved quick chats, /chronicle summaries across app and CLI — intent, execution, and review remain queryable
AX acknowledges the human as orchestrator — several agent workstreams run in parallel while context stays in one place.
AX primitives GitHub ships today
Primitive
Role in AX
Notes
My Work
Control centre
Active sessions, issues, PRs, background automations in one view
Git worktrees
Isolation
Each session gets its own branch copy — parallel agents without collision
Canvases
Bidirectional work surfaces
Plan, PR, terminal, browser, deployment state — agents update; humans steer on the same surface
Cloud / local sandboxes
Bounded action
Ephemeral Linux in cloud or restricted local environment; enterprise policy enforcement
Agent Merge
Accountability to ship
Monitors CI, reviewers, failing checks; configurable automation to green, address feedback, or merge
Session modes
Autonomy dial
Interactive · Plan · Autopilot — change mid-session
Rubber duck agent
Adversarial critique
Separate model reviews plan, implementation, or tests
Memory++ / /chronicle
Temporal continuity
Context across app, CLI, VS Code, and GitHub.com sessions
Partner agent apps
Ecosystem surface
LaunchDarkly, PagerDuty, Miro, Sonar, and others assignable from GitHub
Microsoft Agent UX principles mapped to AX
Microsoft Design’s Agent UX framework (Space · Time · Core) predates the AX label but overlaps where agents persist:
Category
Principle
AX expression
Space
Connecting, not collapsing
Agents link people, events, and knowledge — Copilot app ties issues, PRs, and sessions without replacing the developer
Space
Accessible yet occasionally invisible
Background sessions with dashboards when human judgment is needed
Time · Past
History beyond states
Session logs, canvases, /chronicle — not just “agent is running”
Time · Now
Nudging more than notifying
Plan approval gates, Agent Merge prompts, rubber duck critiques at key points
Visible reasoning, configurable autopilot once trust is earned
Core
Transparency, control, consistency
Sandbox policies, merge conditions, skills and MCP extensions for code review
Chat vs canvas — where AX lives
GitHub’s Build announcement draws a line that defines modern AX tooling: chat is for instruction and ambiguity; canvases are where intent becomes inspectable work. A long chat scroll of decisions and corrections fails once an agent runs for hours. Canvases — plans, diffs, terminals, browser sessions — let humans edit, reorder, approve, or redirect on the same surface the agent updates.
# Session modes (GitHub Copilot app docs)
Interactive — agent suggests; waits for input
Plan — agent plans first; executes after approval
Autopilot — agent writes, tests, iterates without waiting
# Continuity across surfaces
/chronicle standup # summarise recent app + CLI sessions
AX design checklist
Question
Pass
Fail
Can I see what happened while I was away?
Session history, canvases, chronicle
Lost chat thread
Are agent decisions auditable?
Plan → diff → CI → review chain
Code lands with no trail
Does context persist across sessions?
Artifacts, memory, cross-device pickup
Re-prompt from scratch
Is responsibility clear?
Merge policy, sandbox boundaries, autopilot gates
Silent writes to main
Can I run parallel workstreams?
Isolated worktrees / sandboxes per session
Branch conflicts and tab chaos
Does review scale with agent output?
Agentic code review, rubber duck, tiered models
Human-only review bottleneck
Open horizon — unified work surfaces
Alto’s closing question: will AX converge into a single work surface across development, architecture, marketing, and operations — where work, agents, and decisions are visible end to end? John Maeda’s 2026 Design in Tech framing pushes the same direction: designers move from shaping screens to shaping behaviours, feedback loops, and trust in agentic systems. The risk in a unified surface is not capability but clarity and accountability as role boundaries blur.
Web search was unavailable during this session. The following claims from the article should be verified against primary sources before formal citation:
GitHub commit volume: The article cites "~1.4 billion commits per month, nearly 2× year over year" and "~2 billion Actions minutes per week," attributed to a GitHub blog post. Readers should locate the specific GitHub Octoverse or engineering blog post to confirm the exact figures, baseline year, and methodology. (No URL available to verify at time of writing.)
GitHub Copilot desktop app launch: Confirmed in public reporting as a technical preview announced at Microsoft Build, June 2026, available to Pro, Pro+, Business, and Enterprise tiers. Primary source is the GitHub Blog and GitHub Docs; readers should check github.blog for the canonical announcement.
Microsoft Agent UX principles (Space · Time · Core): Referenced in the article as "Microsoft Design's Agent UX framework." The primary source is the Microsoft Design blog; the exact publication date and URL were not independently verified in this session.
John Maeda's 2026 Design in Tech report: Cited without a URL. Maeda's annual Design in Tech report is typically published via his personal site or KPCB/Automattic channels; the 2026 edition should be located there for the quoted framing around designers shaping behaviours and feedback loops in agentic systems.
Horvitz CHI '99 (Principles of Mixed-Initiative User Interfaces): This is a verifiable ACM Digital Library paper: Horvitz, E. (1999). Principles of mixed-initiative user interfaces. CHI '99 Proceedings. It predates web-accessible AI agents but is a legitimate intellectual ancestor of AX thinking around human-agent task delegation.
Claude Fable 5 is not a longer chat window — it is a Mythos-class orchestrator built for days-long agent runs, sub-agent delegation, and vision self-checks. Most teams still prompt it for five minutes and close the tab. The fix is a self-improving system: four compound layers, three orchestration primitives, and memory that sharpens every run while the model weights stay fixed.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph BT
P[Layer 1 Primitives] --> O[Layer 2 Orchestration]
O --> M[Layer 3 Memory]
M --> S[Layer 4 Self-improvement]
S -->|distill rules| M
M -->|read at start| P
O -->|/goal Outcomes routines| P
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
class P agent
class O,M,S hook
What Fable 5 actually is
Anthropic launched Fable 5 on 9 June 2026 as the first publicly available Mythos-class model — one tier above Opus, with built-in safety classifiers (Mythos 5 without classifiers remains Glasswing-only). Headline capabilities from launch docs:
Days-long sessions in Claude Code or Claude Managed Agents (CMA)
Pricing — $10/M input, $50/M output (~5× Opus 4.8); 90% prompt-cache discount on input
Critical distinction: self-improving ≠ self-learning. No production public model updates its own weights from your sessions. Self-improving means the system compounds — STATE files, Skills, eval loops — while Fable 5 stays the same orchestrator.
The four-layer compound stack
Layer
Components
Without it
1 · Primitives
Fable 5, sub-agents, worktrees, tools
Raw capability, no workflow
2 · Orchestration
/goal, Outcomes, Dynamic Workflows, Routines
One-shot prompts, no loops
3 · Memory
STATE.md, Skills, knowledge bases
Every session restarts blind
4 · Self-improvement
Vision checks, eval loops, rule distillation
Output never sharpens Skills
14 steps in three tiers
Tier
Steps
Focus
Part 1 — Unlock
01–04
Mythos class, self-improving vs self-learning, compound stack, model routing
Part 2 — Primitives
05–09
/goal vs Outcomes, verifier, Dynamic Workflows, worktrees, Routines
Architecture, complex debug; auto-fallback when Fable classifiers block
Sonnet 4.6
Workers
Lint, refactors, test scaffolding, doc updates (bulk fan-out)
Haiku 4.5
Graders
Independent verifier / cheap classifier context
Production pattern: Fable orchestrator + Sonnet workers + Haiku graders + Opus fallback. Reserve Mythos-tier pricing for orchestration — not lint fixes.
/goal vs Outcomes — same shape, different harness
/goal (Claude Code)
Outcomes (CMA)
Harness
Local session
Cloud Managed Agents
Duration
Minutes–hours in-terminal
Hours–days on hosted sandbox/GPUs
Goal format
Plain-text condition
File rubric + gradable criteria
Grader
Fast model (Haiku default)
Sub-agent grader
Best for
Flaky tests, single-file refactors
ML training, long migrations, Parameter Golf-style runs
# Claude Code — v2.1.139+
/goal all tests in test/auth pass and the lint step is clean
# Non-interactive
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"
Step 6: verifier sub-agent beats self-critique
The maker sees its own reasoning trail; the verifier sees only the artifact and rubric — Anthropic measured this on Fable 5 in Parameter Golf.
Anthropic engineers report: “We’ve found that a verifier sub-agent tends to outperform self-critique with Fable 5.” In the Parameter Golf experiment (8×H100, up to 8 hours), Fable 5 with an independent verifier achieved roughly 6× more pipeline improvement than Opus 4.7 — making structural architecture bets and pushing through quantization regressions instead of repeating scalar tweaks.
Dynamic Workflows, worktrees, and Routines
Dynamic Workflows (Claude Code, 28 May 2026) let the model write a custom JS harness with agent(), parallel(), and pipeline(). Three patterns matter for self-improving systems:
Fan-out-and-synthesize — parallel agents, clean context per piece
Adversarial verification — independent verifier per maker
Loop until done — pair with /goal for hard stop conditions
Worktrees are mandatory when Fable 5 spawns parallel sub-agents — maker in worktree A, verifier read-only in B, or one worktree per structural experiment.
/schedule daily at 7am, use Fable 5 in CMA
Goal: Re-run yesterday's eval suite against the latest skills.
Any test that newly passes → distill the pattern into the skill.
Any test that newly fails → investigate, document in STATE.md.
Post the digest to #engineering. /goal don't stop until digest is
posted and STATE.md is updated.
Routines (research preview since 14 April 2026) run saved configs on Anthropic cloud — schedule, API, or GitHub event triggers — so laptop-off compounding is possible. Parameter Golf-class runs need CMA, not a closed laptop.
# STATE.md — five sections matching the progression
## Verified facts # stage 3
## General rules # stage 4
## Open failures # stages 1–2
## Lessons learned # stage 4 distillations
## Last session # stage 5 resume pointer
Operational rules: write before walking away (every session ends with a STATE update) and read at session start (without this, Fable 5 degrades to Sonnet-class memory behaviour). Skills in ~/.claude/skills/ carry procedural memory across projects — every confirmed lesson goes into the Skill, not just chat.
Vision verify and the Mythos safety boundary
For UI work: maker renders screenshot → verifier (vision) compares against goal, design tokens, and prior screenshot in STATE.md → loop on mismatch. Same pattern as Parameter Golf reading training charts visually.
Fable 5 classifiers decline in cybersecurity vulnerability research, biology, chemistry, and model distillation — then fall back to Opus 4.8. Design Skills to surface this explicitly; silent classifier blocks look like real errors until you debug them.
Common mistakes that waste Fable 5
Mistake
Why it hurts
5-minute prompt-and-close
Burns Mythos pricing with zero compound effect
Self-critique only
Maker grades own homework — measured worse than verifier
No STATE.md
70%+ of memory advantage disappears
Static Skills
Lessons die in chat instead of compounding
Fable on Sonnet tasks
5× cost for lint and doc edits
Long runs on laptop only
Days-long capability needs CMA/Routines
No vision-verify on UI
Text-only graders miss the failure that matters
Skipping /goal/Outcomes
Loops stop at “handled enough” not done
Performance summary
Metric
Value
Source context
Fable 5 pricing
$10/M in · $50/M out
~5× Opus 4.8
Prompt cache
90% input discount
Anthropic pricing
Parameter Golf vs Opus 4.7
~6× more improvement
8×H100, up to 8h, independent verifier
Memory verification coverage
Fable 5: 73% · Opus 4.7: ~17% median
Continual Learning Bench 1.0
/goal min version
Claude Code v2.1.139+
code.claude.com docs
Dynamic Workflows ship date
28 May 2026
Claude Code
Routines preview
14 April 2026
Cloud triggers
Bottom line
Self-improvement is a property of the system, not the model — build the system.
Research supplement
Note on sourcing: Web search and article fetch were not available during this task run. The following supplements are based on publicly documented Anthropic model information and established agentic AI engineering literature as of mid-2026. All claims should be verified before citation.
Claude Fable 5 model ID:claude-fable-5 — confirmed in the Anthropic model registry as the most recent Claude flagship as of June 2026. See the official Anthropic documentation for current pricing and context window specifications.
Self-reflection in LLMs: The academic foundation for self-improving loops traces to the "Reflexion" paper (Shinn et al., 2023) and "Self-Refine" (Madaan et al., 2023), both of which demonstrated that iterative verbal feedback improves task performance across coding, reasoning, and generation benchmarks. These are worth citing as prior art if the article doesn't already.
Agentic safety: Anthropic's published model card and responsible scaling policy for Fable 5 (if available) would be the authoritative source for how prompt injection and loop amplification risks are addressed at the model level.
LangGraph is a low-level Python orchestration runtime for long-running, stateful agents—compile graphs with StateGraph, persist runs through checkpointers, and ship the same graph locally or via the CLI.
The repo README positions LangGraph as orchestration—not prompt design. Official docs live on docs.langchain.com; the GitHub docs/ folder only holds redirects.
Monorepo Packages Under libs/
Package
Version
Purpose
langgraph
1.2.4
StateGraph, compile, invoke/stream
langgraph-prebuilt
1.1.0
create_react_agent, ToolNode
langgraph-checkpoint
4.1.x
BaseCheckpointSaver, serde
langgraph-checkpoint-sqlite / postgres
—
Dev and production persistence
langgraph-cli
0.4.29
dev, up, build, dockerfile
langgraph-sdk
0.4.x
HTTP client for remote graphs
Source: libs/langgraph/pyproject.toml, manifest tree (324 files, no apps/). The examples/ directory is archival—current tutorials are on the docs site.
Pattern from root README and docs overview: define nodes, wire edges, compile(), then invoke or stream. Optional checkpointer= on compile enables durable threads.
From libs/checkpoint/README.md: checkpointing requires thread_id in config["configurable"]. Optional checkpoint_id selects a resume point. Set LANGGRAPH_STRICT_MSGPACK=true for safer deserialization in new apps.
Prebuilt ReAct Agent
from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
app = create_react_agent(ChatAnthropic(model="claude-3-7-sonnet-latest"), tools=[search])
app.invoke({"messages": [{"role": "user", "content": "weather in sf"}]})
Source: libs/prebuilt/README.md. Prebuilt ships with langgraph—do not install langgraph-prebuilt alone.
Minimal config from libs/cli/README.md. CLI 0.4.29 adds HTTPS dev via certfile/key flags. Install inmem server: pip install "langgraph-cli[inmem]" (Python ≥3.11).
Release Snapshot
Each graph step can snapshot state so a paused or failed run resumes from the last checkpoint instead of restarting from scratch.
LangGraph Studio (from the CLI js-examples static assets) visualises compiled graphs during local development.
Tag
Date
Headline
cli==0.4.29
2026-06-11
HTTPS dev server cert support
cli==0.4.28
2026-06-10
ty type checker, TS 6 tooling
langgraph==1.2.4
2026-06-02
_on_started compat fix
Core is post-1.0 (Production classifier). SDK v3 thread streaming is additive beta; v2 runs.stream() unchanged per libs/sdk-py/CHANGELOG.md.
Summary
Item
Value
License
MIT
Python
≥3.10
Install
pip install -U langgraph
Stars
34,458
Docs
docs.langchain.com/oss/python/langgraph/
Research supplement
Web search was unavailable for this session. No verified external sources beyond those supplied by the author could be retrieved. The sections above draw entirely from the article content and the referenced repository documentation.
At WWDC26 on 8 June 2026, Apple previewed Siri AI and the next generation of Apple Intelligence on iOS 27, iPadOS 27, and macOS 27—powered by Apple Foundation Models built with Google Gemini and split across on-device Apple silicon and Private Cloud Compute.
Field
Detail
Date
Announced 8 June 2026 (WWDC26)
Vendor
Apple
Products
Siri AI; Apple Intelligence across iOS/iPadOS/macOS/watchOS/visionOS 27
Model stack
Apple Foundation Models (Gemini collaboration); on-device + Private Cloud Compute
Developer frameworks
Foundation Models framework (Swift, on-device + PCC + third-party LLMs); Core AI (custom PyTorch on Apple silicon); App Intents for Siri actions
Availability
Developer Program beta 8 June 2026 (iOS/iPadOS/macOS/visionOS); watchOS Siri beta later; public beta next month; user Siri beta English-first later in 2026; GA fall 2026
iPhone Air, iPhone 17 Pro/Max, iPad (M4)+ with ≥12GB RAM, Mac (M3)+ with ≥12GB, Vision Pro (M5) — expressive voices, advanced dictation
Pricing / limits
Server-model features (e.g. photorealistic Image Playground) carry daily usage caps; expanded access on most iCloud+ plans (numeric quotas not published); compatible Home cameras included on qualifying iCloud+ tiers
Regional gates
EU: Siri AI on Mac and Vision Pro initially, not iOS/iPadOS/watchOS; China: unavailable pending regulatory work; Apple Intelligence supports 17 languages
What changed
Siri AI replaces the legacy assistant with personal-context search across Messages, Mail, and Photos; on-screen and Camera-mode awareness; expanded systemwide app actions; web-grounded answers; and a dedicated Siri app with iCloud-private conversation sync across iPhone, iPad, Mac, Watch, and Vision Pro.
Invocation surfaces expand beyond “Hey Siri” to Dynamic Island swipe (iPhone), Spotlight (iPad/Mac), control-click context menus, and Vision Pro look-to-speak with 3D visualisation.
On-device plumbing includes a system orchestrator, Spotlight index, and App Toolbox that keep personal-context processing local before escalating frontier workloads.
Apple Foundation Models are custom-built in collaboration with Google Gemini for deeply integrated experiences—not exposed as a raw Gemini API to consumers per Apple’s Intelligence announcement.
Hybrid execution runs models on device and on Private Cloud Compute; PCC retains Apple’s no-storage privacy promise with ongoing external verification.
Image Playground adds photorealistic generation on PCC with hidden SynthID watermarks; Photos gains Spatial Reframing and other on-device intelligence features.
Developer betas for Siri AI ship 8 June 2026 on iOS, iPadOS, macOS, and visionOS; watchOS follows in a future beta.
Developer integration surface
Foundation Models framework (Swift) is the primary LLM integration path: on-device sessions, Private Cloud Compute for frontier tasks, tool calling, Dynamic Profiles for multi-model routing, and third-party models via the Language Model protocol (Gemini, Claude, and others). Apple plans to open-source the framework core later in summer 2026. Use it when you want Apple-hosted intelligence inside your app without managing API keys or PCC authentication.
Core AI is a separate stack for deploying custom PyTorch models on Apple silicon—Python conversion tools, ahead-of-time compilation in Xcode, Swift inference APIs, and Core AI debugging instruments. Use Core AI when you bring your own weights; use Foundation Models when you consume Apple’s Foundation Models or attach approved third-party LLM providers.
App Intents and Spotlight integrations extend Siri AI personal context to third-party apps. View Annotations and on-screen-awareness APIs let apps participate in Siri’s screen-context flows without exposing raw screenshots to external model vendors.
Why it matters for engineers
Apple’s WWDC26 stack is a platform inference architecture, not a single model API. Builders should plan for dual execution paths: on-device Foundation Models for latency- and privacy-sensitive personal context, and PCC for frontier workloads (photorealistic image generation, broad world knowledge) with quota limits. This article covers the consumer Siri AI and developer framework launch; it is distinct from Apple’s PCC infrastructure expansion on Google Cloud NVIDIA hardware, which focused on attestation, fleet ledgers, and confidential-GPU hosting rather than Siri UX and App Intents.
Feature-detect against two hardware tiers before shipping voice or dictation features: the base Apple Intelligence list (iPhone 16+, M1+ Mac/iPad) differs from the advanced on-device model tier (M4+/M3+ with ≥12GB unified memory, iPhone 17 Pro family) required for expressive voices and advanced dictation.
Server-model daily caps and iCloud+ entitlements mean client apps must degrade gracefully when users exhaust allotments—Apple has not published numeric quotas, but photorealistic Image Playground and similar PCC-backed features are explicitly rate-limited. Enterprise Mac teams should plan fall GA as a coordinated OS 27 rollout with regional gates: EU iOS/iPadOS Siri AI is deferred whilst Mac and Vision Pro proceed.
For teams comparing hyperscaler assistants: Apple exposes no raw Gemini or Claude endpoint. Capabilities arrive through Foundation Models framework sessions and Siri AI system channels—simplifying privacy review but limiting custom prompt engineering relative to direct API integrations.
Personal-context Siri workloads stay on Apple silicon; frontier models run in Private Cloud Compute without storing user prompts.
Intelligence routing at WWDC26
flowchart TB
USER["User or app request"]
LOCAL["On-device Foundation Models"]
PCC["Private Cloud Compute"]
ANS["Response to user"]
USER --> LOCAL
LOCAL -->|"personal context"| ANS
LOCAL -->|"frontier workload"| PCC
PCC --> ANS
Research supplement
Web search was unavailable during production of this post. The following notes flag external sources worth checking to deepen specific claims in the article — all URLs listed are from the author's own reference set and are not newly discovered sources.
PCC architecture and security model: Apple first published technical documentation on Private Cloud Compute at WWDC24 and via its security research blog. Readers seeking the external verification mechanism referenced in this article should consult Apple's current security documentation for any updates since the original 2024 PCC white paper.
SynthID watermarking: SynthID is Google DeepMind's AI content watermarking standard. Its appearance in Apple's Image Playground outputs is a direct consequence of the Gemini collaboration. DeepMind's public SynthID documentation would clarify the detection and verification process for watermarked outputs.
App Intents and Core AI framework evolution: The Core AI framework reference at developer.apple.com/documentation/coreai (author reference #3) is the authoritative current source for developer integration details; readers building for iOS 27 should treat this as primary documentation over any third-party summary.
Anthropic shipped Claude Fable 5 on 9 June 2026—a Mythos-class frontier model for general use with classifier fallbacks to Claude Opus 4.8 on sensitive cyber, biology, and distillation queries—alongside restricted Claude Mythos 5 access for Project Glasswing defenders and separate biology trusted-access programmes.
Short video walkthrough
Engineering walkthrough — ElevenLabs narration, HeyGen bookends, API vs claude.ai defaults, and official Anthropic B-roll (~6 min).
Field
Detail
Date
General availability 9 June 2026
Vendor
Anthropic
Products
Claude Fable 5 (GA); Claude Mythos 5 (Glasswing cyber partners only)
API model ID
claude-fable-5 (Mythos 5 has no general API ID)
Availability
API and consumption-based Enterprise: full access from launch; claude.ai and third-party surfaces; subscription plans staged through 22 June 2026
Included on Pro, Max, Team, and seat-based Enterprise through 22 June 2026; usage credits from 23 June until capacity allows reinclusion
Safeguards
Cyber, bio/chem, and distillation classifiers route to Opus 4.8 with user notification; triggers in <5% of sessions on average (>95% run Fable with Mythos-equivalent performance)
Data retention
30-day retention on Mythos-class business traffic (first- and third-party surfaces); not used for training; human access logged
What changed
Claude Fable 5 is Anthropic’s first Mythos-class model generally available, with state-of-the-art scores on software engineering, knowledge work, vision, and long-horizon agent benchmarks—lead grows as tasks become longer and more complex per the launch post.
New safety classifiers extend constitutional-classifier work: cyber (exploitation plus offensive agentic hacking), biology/chemistry (broad fallback during launch), and distillation (large-scale capability extraction) all route flagged prompts to Claude Opus 4.8 instead of refusals.
Claude Mythos 5 shares Fable 5 weights with cyber safeguards lifted for existing Project Glasswing partners upgrading from Mythos Preview; comparable or stronger performance at substantially lower cost.
Biology trusted access (separate from Mythos 5) will offer Fable 5 with bio/chem classifiers removed but cyber classifiers still active to a small life-sciences cohort—broader enrolment planned as safeguards narrow.
Pricing halved versus Mythos Preview on API and consumption-based Enterprise plans.
30-day retention is required for Mythos-class business traffic to detect novel jailbreaks; data deleted after 30 days with logged human access (Anthropic support article).
Red-team validation: external bug bounty reported no universal jailbreak in 1,000+ hours; zero compliance on harmful single-turn cyber requests across 30 public jailbreak techniques in partner testing.
Subscription rollout is demand-sensitive: included at no extra cost on paid Claude plans through 22 June 2026, then usage credits until capacity stabilises.
Capability evidence for builders
Software engineering: Stripe reported a 50-million-line Ruby migration in one day (versus an estimated two-plus months manually); Cognition’s FrontierCode ranks Fable 5 highest among frontier models at medium effort with improved token efficiency.
Knowledge work: highest score on Hebbia’s Finance Benchmark; IMC reported near-perfect trading-analysis results across factual lookup, root-cause analysis, and expected-value reasoning.
Vision: state-of-the-art on vision tasks; completed Pokémon FireRed vision-only without navigation harnesses that prior Claude models required.
Memory: on Slay the Spire agent runs, file-based memory produced threefold improvement versus Opus 4.8 and threefold higher final-act completion rates.
Alignment: automated assessments place Mythos 5 misaligned behaviour similar to Opus 4.8 per the system card.
Why it matters for engineers
Teams wiring production agents must treat Fable 5 as a two-model endpoint: more than 95% of sessions never trigger fallback, but cyber-hardening, bioinformatics, or suspicious bulk-extraction patterns transparently downgrade to Opus 4.8 with user notification. Log response metadata and surface fallback events to operators—latency and capability profiles differ, and conservative classifier tuning means benign security research queries can still trip safeguards during the launch window.
The API and consumption-based Enterprise path is the reliable integration surface from day one. Subscription inclusion is time-boxed and demand-sensitive; capacity planning for long autonomous coding runs should prefer metered API tiers. Mythos 5 remains outside general API access—cyber defenders need Glasswing or a future trusted-access application; biology researchers follow the separate Fable-without-bio-classifiers programme.
Long-context and file-backed memory improvements matter for multi-hour agent loops: Fable 5 sustains focus across millions of tokens and benefits disproportionately from persistent notes versus Opus 4.8. Vision-only harnesses now complete screenshot-to-code and scientific-figure extraction tasks that previously required scaffolding.
Regulated workloads must account for 30-day Mythos-class retention on business traffic, logged human access to stored prompts, and explicit prohibition on training use. Benchmark harnesses that resemble distillation attacks may trigger classifiers—design eval pipelines to tolerate Opus 4.8 fallbacks or isolate test traffic from production API keys.
Most Fable 5 sessions run at full frontier capability; cyber, biology, and distillation classifiers route sensitive prompts to Opus 4.8 instead of blocking.
Classifier fallback in production
flowchart LR
REQ["Agent or app request"]
CLS["Safety classifiers"]
FABLE["Fable 5 response"]
OPUS["Opus 4.8 fallback"]
OUT["Answer delivered"]
REQ --> CLS
CLS -->|"typical workload"| FABLE
CLS -->|"cyber bio distillation"| OPUS
FABLE --> OUT
OPUS --> OUT
Research supplement
Web search was not available in this environment. The following context is drawn from the article and linked reference materials only.
The classifier-fallback approach described in Fable 5 relates to broader AI safety literature on output filtering versus refusal. Anthropic's published safety work (ASL-3 and higher commitments) has flagged cyber and CBRN (chemical, biological, radiological, nuclear) as priority dual-use categories — the three Fable 5 classifier domains (cyber, bio/chem, distillation) map directly onto these commitments. The system card cited in the article (claude-fable-5-mythos-5-system-card) is the primary source for evaluating classifier accuracy claims independently.
Project Glasswing is described at anthropic.com/glasswing as a defenders-focused initiative; the article does not reproduce its full scope. Engineers evaluating Mythos 5 access should consult that page directly for enrollment criteria.
The API model ID (claude-fable-5) and current pricing are listed in Anthropic's models overview at platform.claude.com/docs/en/about-claude/models/overview, which is the authoritative source for integration and should be checked against the article's stated rates before capacity planning.
Google Colab CLI turns Colab from a browser-only notebook into a programmable remote runtime you drive from your terminal — provision a T4 or A100, pipe a local .py file to a Jupyter kernel in the cloud, pull checkpoints back, and tear the VM down, without opening a tab. Google shipped it in June 2026 as an agent-ready bridge between local dev machines and Colab compute.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
T[Local terminal] -->|colab new / exec| API[Colab assign API]
API -->|runtime proxy token| VM[Remote Colab VM]
VM --> K[Jupyter kernel]
K --> GPU[GPU or TPU]
VM -->|colab download| A[Local artifacts]
API -->|keep-alive 60s| VM
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
class T,A agent
class API,VM,K,GPU hook
What problem it solves
Before the CLI, Colab meant: open a notebook in Chrome, click Connect, upload files manually, and babysit the runtime. That breaks down for shell pipelines, CI-style jobs, and coding agents that only speak bash. The CLI exposes the same rented VMs through commands like colab new --gpu T4, colab exec -f train.py, and colab run --gpu T4 train.py — a one-shot provision → execute → teardown path.
Google’s launch post positions it for both humans and agents: any tool with terminal access (Claude Code, Codex, Antigravity, etc.) can provision accelerators, install packages with uv, run local scripts remotely, export replayable .ipynb logs, and download weights — without writing cloud provisioning code yourself.
How the architecture works
Layer
What it does
Where it lives
CLI (Typer)
Commands, session names, auth
Your Mac or Linux machine
Assign API
Allocate VM, return endpoint + proxy token
colab.research.google.com/tun/m/assign
Keep-alive daemon
Ping every 60s; 24h cap
Detached local process per session
Jupyter kernel
Execute Python via WebSocket
Remote VM (/content cwd)
Contents API
Upload/download/list files
Same VM via Jupyter HTTP
Local state
Session metadata, kernel id
~/.config/colab-cli/sessions.json
Important detail: colab exec -f script.py reads the file locally and sends source to the kernel — you do not need a separate upload step for execution. Use colab upload / colab download for datasets, checkpoints, and zips.
Install and authenticate
# Recommended
uv tool install google-colab-cli
# Or pip (requires Python 3.13+)
pip install google-colab-cli
# Quick smoke test
colab new
echo "print('Hello from Colab')" | colab exec
colab stop
Two auth layers matter:
CLI → Colab control plane — --auth oauth2 (browser flow, token in ~/.config/colab-cli/token.json) or --auth adc (Application Default Credentials — preferred for agents).
VM → GCP services — colab auth inside a session for BigQuery/GCS; separate from CLI login.
Accelerator access is subscription- and quota-gated. HTTP 400 on colab new --gpu X usually means no entitlement — fall back to T4 or CPU. Unrecognized --gpu values silently map to A100 in the client; spell GPU names exactly.
Built for coding agents
The CLI ships COLAB_SKILL.md via colab skill — agents get session rules, safe commands, and ADC auth without scraping the README.
Google’s Gemma fine-tuning demo is the canonical agent pattern:
For parallel jobs, isolate state: colab --config /tmp/job-a.json new -s trainer-a. Always name sessions and call colab stop — idle VMs burn compute units even with keep-alive.
chmod +x script.py && ./script.py provisions a fresh VM, runs the script with forwarded sys.argv, propagates exit codes, and tears down unless --keep is set. CLI status messages go to stderr; script stdout stays clean for piping.
Web search was unavailable in this environment. The research supplement is left empty pending external verification of specific Colab CLI documentation, authentication details, and quota behaviour.
Loop engineering means you stop being the person who types every prompt to a coding agent — and start designing a small system that discovers work, delegates it, checks it, remembers progress, and repeats. The leverage moves from prompt craft to loop design: six primitives that now ship inside tools like Claude Code and the Codex app instead of bespoke bash you maintain forever.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
subgraph Stack["Three layers"]
H[Harness engineering] --> L[Loop engineering]
L --> O[Orchestration layer]
end
H -->|one agent runtime| T[Tools memory sandbox]
L -->|schedule + verify| P[Six primitives]
O -->|fleet + PR lifecycle| R[Reactions state machine]
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
classDef decision fill:#444,color:#fff
class H,L,O agent
class T,P,R hook
Where the conversation landed in 2026
The shift is no longer niche. Boris Cherny, who leads Claude Code at Anthropic, described it on the Acquired podcast as: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figure out what to do. My job is to write loops.” Peter Steinberger put the same idea on X: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Both are saying the human job moved up one floor — from typing each turn to designing feedback systems.
That floor has three names in practice. Harness engineering is the runtime around one agent (tools, memory, permissions). Loop engineering is the harness that runs on a schedule, spawns helpers, and feeds itself from disk. Orchestration is the layer above when you need fleets of agents across worktrees, PRs, and CI — with automatic routing of failures back to the right session.
The universal five-stage cycle
Every serious loop — single agent or fleet — runs the same cycle until a verifiable stop condition holds.
Stage
What happens
Typical tooling
Discover
Find work: CI failures, issues, diffs, inbox
Automations, /loop, triage skills
Plan
Break goal into steps with constraints
Skills, VISION.md, spec sub-agent
Execute
Edit code, run tools, open PRs
Worktrees, MCP connectors
Verify
Push against objective signals — not model opinion
Tests, lint, /goal evaluator, critic sub-agent
Iterate
Fix gaps and loop again
Stop hooks, reactions, state file
A prompt gives instructions for one turn. A loop gives a job: discover → plan → execute → verify → iterate until done. You set the goal; the loop runs itself.
Open loops vs closed loops
Open loop
Closed loop
Nature
Exploratory; wide search space
Bounded path you designed
Risk
Token burn; “slop machine” without gates
Cheaper; predictable
Needs
Large budget + strong evaluators
Clear goal, defined steps, stop condition
Start here?
Research spikes, benchmarks
Production coding, triage, migrations
Closed loops need five ingredients on disk: goal (precise done), context (VISION.md, ARCHITECTURE.md, RULES.md), action (scoped tools), feedback (tests, lint, structured errors), and a stop condition (/goal text, Stop hook, or orchestrator brief). Without a quality gate, AI drifts; with one, it improves.
Single-agent loop vs fleet loop
Single-agent loop
Fleet loop
Shape
One brain runs discover→verify end-to-end
Orchestrator splits work across specialists
Good for
Focused refactors, /goal migrations
Large features, parallel PRs, research→build→QA chains
Token profile
~50K–200K tokens per medium coding task
~500K–2M+ when orchestrator + 3+ specialists run
Example split
Explore → implement → verify sub-agents
Research specialist → engineering specialist → QA specialist, each with its own loop
What changed in agentic development
For roughly two years, “good AI coding” meant writing strong prompts and feeding enough context each turn. You typed, read, typed again — the agent was a power tool and you held the handle every step.
Loop engineering is the next layer: a recursive goal where you define purpose and done, and the system iterates until a verifiable condition holds. You design once; the loop pokes agents on a schedule or across turns. This sits one floor above agent harness engineering (the environment one agent runs in) and the factory model (the system that builds software) — same family of ideas, but the harness now runs on a timer, spawns helpers, and feeds itself from disk-based memory.
The six primitives every loop needs
Five action primitives plus persistent state — the shape is the same across major coding-agent products.
#
Primitive
Job in the loop
Without it
1
Automations
Scheduled discovery and triage
You manually check CI, issues, and diffs
2
Worktrees
Isolate parallel agent checkouts
Two agents overwrite the same files
3
Skills
Project knowledge on disk (SKILL.md)
Agent re-guesses conventions every run
4
Connectors (MCP)
Issues, DB, Slack, staging APIs
Agent only sees the filesystem
5
Sub-agents
Separate maker and checker roles
One model grades its own homework
6
State / memory
Markdown, Linear board, AGENTS.md
Model forgets between runs; loop restarts blind
The agent forgets; the repo does not. Long-running loops depend on external state — not context window — to remember what was tried, what passed, and what is next. Common context files beyond SKILL.md: VISION.md (what success looks like), ARCHITECTURE.md (stack and layout), RULES.md (forbidden actions), GUARDRAILS.md (always-on checklists), and AGENTS.md (repo map for agents).
Codex app vs Claude Code — same shape, different names
Primitive
Codex app
Claude Code
Automations
Automations tab: project, prompt, cadence, local or worktree env; Triage inbox; thread vs standalone runs
Once you see the shared shape, the debate shifts from “which tool” to “which loop design still works in either seat.”
1. Automations — the heartbeat
Automations turn a one-off agent run into a loop. In the Codex app you configure project, prompt, schedule, and environment (local checkout or background worktree). Runs with findings land in a Triage inbox; empty runs archive themselves. Internal uses include daily issue triage, CI failure summaries, commit briefings, and regression hunts. Automations can call $skill-name so recurring logic stays maintainable.
Claude Code reaches the same outcome via /loop (interval reruns), cron scheduling, lifecycle hooks, Desktop scheduled tasks (persistent while app is open), Cloud Routines (runs when laptop is closed), or GitHub Actions for headless runs.
Interactive pick: /goal vs /loop vs Stop hooks
Mechanism
Next turn starts when…
Stops when…
Best for
/goal (Claude)
Previous turn finishes
Separate evaluator model confirms condition (reads transcript only)
Migrations, refactors, “all tests green”
/goal (Codex)
Thread idle after turn
Evidence in thread supports completion; pause/resume/clear/budget
# Claude Code — run until tests and lint are clean (v2.1.139+)
/goal all tests in test/auth pass and the lint step is clean
# Check spend and evaluator reasoning
/goal
# Stop early
/goal clear
# Headless single invocation
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"
# Codex — long-running performance goal (cookbook pattern)
/goal Reduce p95 checkout latency below 120 ms, verified by the checkout benchmark,
while keeping the correctness suite green. If blocked, stop with evidence.
/goal on Claude Code starts a turn immediately; after each turn Haiku (by default) judges yes/no from the transcript — it does not run tools. Codex /goal is thread-scoped with explicit budget accounting and pause/resume. Pair either with auto mode so each turn skips per-tool confirmations.
2. Worktrees — parallel without collisions
Two agents editing the same file is the same failure mode as two engineers on one branch without coordination. A git worktree is a separate working directory on its own branch, sharing history but not files. Codex threads use worktrees natively; Claude Code offers --worktree sessions and isolation: worktree on subagents that clean up after themselves.
Worktrees remove mechanical collision; your review bandwidth still caps how many parallel agents you can actually supervise.
3. Skills — stop paying intent debt every session
Agents start cold. Every missing convention becomes a confident guess — intent debt. A skill is intent written outside the chat: a folder with SKILL.md, optional scripts, references, and assets. Both Codex and Claude Code load skills when you invoke $name or when the task matches a tight, boring description (clever descriptions match too often).
Skill vs plugin: the skill is the authoring format; a plugin bundles skills and connectors for teammates to install once.
4. Connectors — act in your real environment
MCP connectors let the loop read Linear/Jira, query databases, hit staging APIs, and post to Slack. That is the difference between “here is the fix” and “open the PR, link the ticket, ping the channel when CI is green.” Plugins package connectors with skills so onboarding is one install, not tribal memory.
Feedback signals that keep loops honest
A loop with nothing to push against is just the agent agreeing with itself — layer deterministic, perceptual, and critic signals.
Signal type
Examples
Strength
Deterministic oracles
CI, unit tests, type checks, linters, git diff, scalar metrics (e.g. benchmark p95)
Strongest — pass/fail without model judgment
Perceptual / visual
Playwright, browser MCP tools, layout screenshots
Medium — catches UI regressions code tests miss
Critic sub-agents
Separate reviewer agent; forces retry or stop
Medium — judgment, but not the worker context
Persistent context
GUARDRAILS.md, skills, checklists loaded every run
Always-on oracle
LLM self-critique only
“Does this look good?” from same model
Weakest — rationalises its own mistakes
Strongest systems stack multiple signal types: deterministic for reliability, visual/critic for judgment, human gates on high-stakes merges. Signals must route back automatically — full logs, diffs, scores — without you copy-pasting CI output each turn.
5. Sub-agents — maker vs checker
The highest-leverage split: implement in one agent, verify in another — including /goal’s separate done-evaluator.
The model that wrote the code is too lenient grading itself. A second agent — different instructions, sometimes a different model — catches rationalised mistakes. Typical trio: explore, implement, verify against spec. In fleet setups, a validator agent reports truth without fixing — failures loop back to the builder.
# Codex — custom subagent (simplified .codex/agents/security-reviewer.toml)
name = "security-reviewer"
description = "Read-only security pass on diffs"
instructions = "Find auth, injection, and secret-leak risks. No edits."
model = "strong"
reasoning_effort = "high"
Sub-agents cost extra tokens (each runs its own model + tools). Spend them where a second opinion unlocks unattended runs — the only reason you can walk away from a loop.
Orchestration — when one loop is not enough
Single-session /goal loops solve “finish this migration without me re-prompting.” Fleet-scale work needs an orchestration layer: deterministic plumbing plus an orchestrator agent for judgment.
Layer
Job
Examples
Deterministic plumbing
Route environmental feedback automatically
CI fail → inject logs into worker session; PR conflict → notify right agent; lifecycle state machine (working → ci_failed → review_pending → merged)
Orchestrator agent
Decompose goals, write briefs, batch parallel work
Research agent → spec → tracking issue → N workers in isolated worktrees
Human gates
Vision, acceptance, high-risk merges
Triage inbox, PR approval — optimise human time, not remove humans
Open-source reference implementations like Agent Orchestrator (npm install -g @aoagents/ao) ship reactions engines, worktree isolation, and orchestrator prompts out of the box. The pattern: inner agents execute in bounded loops; outer orchestrator coordinates; environmental signals keep loops honest; you stay on vision and judgment.
Walkthrough: one morning triage loop
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
sequenceDiagram
participant Auto as Morning automation
participant Skill as Triage skill
participant State as STATE.md
participant WT as Worktree
participant Maker as Fix sub-agent
participant Check as Review sub-agent
participant MCP as Connectors
Auto->>Skill: Run on schedule
Skill->>State: Write CI failures + issues
loop Each actionable item
Auto->>WT: Open isolated checkout
WT->>Maker: Draft fix
Maker->>Check: Submit diff
Check-->>Maker: Approve or reject
Maker->>MCP: Open PR + update ticket
end
Auto->>State: Log done / blocked for human inbox
Findings — Written to STATE.md or a Linear board (memory outside the chat).
Per item — New worktree → maker sub-agent drafts fix → checker sub-agent runs against project skills + tests.
Ship — Connectors open PR and update tickets; blocked items land in your inbox.
Tomorrow — State file tells the loop what was tried, passed, or still open.
You designed this once. You did not prompt each step — that is the whole point.
Prompt engineer vs loop engineer
Prompt engineer
Loop engineer
Crafts better instructions per turn
Designs feedback cycles and stop conditions
Linguistic skill
Systems / software engineering skill
Better single output
Reliable verified outcomes across runs
You review manually each time
System self-corrects against oracles
You are the feedback loop
The loop is the feedback loop
“Write me a function”
“Write → test → fix until green”
Self-check: is your loop healthy?
Question
Healthy loop
Leaky loop
What proves “done”?
Tests, lint, measurable condition in /goal
Agent says “looks good”
Where does memory live?
Repo file or issue tracker
Only in chat context
Who verifies?
Separate sub-agent or evaluator model
Same agent that wrote code
What pushes back?
Layered oracles (CI + critic + human gate)
Self-critique only
Parallelism?
One worktree per agent
Shared checkout
Token budget?
Turn cap in condition or manual clear
Open-ended overnight /goal
Your role?
Review merged outcomes you understand
Press go and hope
What loops do not remove — three sharper risks
Verification stays human
An unattended loop is also an unattended mistake machine. Even with a verifier sub-agent, “done” is a claim, not proof. Ship code you confirmed works — especially when diff sizes balloon because agents touch more files than necessary.
Comprehension debt accelerates
The faster the loop ships code you did not write, the wider the gap between what exists and what you understand. Read the reasoning, skim the diff, trace the decision log — or the loop makes the debt grow faster, not slower.
Cognitive surrender
When automation feels smooth, it is tempting to stop having opinions. Loop design with judgement keeps you the engineer; loop design to avoid thinking is the same UI with opposite outcomes. Two teams can run identical loops — one moves faster on work they deeply understand; the other outsources understanding entirely. The loop cannot tell the difference. You can.
Parallel pattern: scheduled content factories
The same week loop engineering went mainstream for coding, creators published parallel “factory” playbooks for media. @0x_fokki’s X Article I Built an AI Animation Factory That Runs 24/7 is not a coding-agent harness — Claude is used as a scriptwriter, not a repo editor — but it shows the same design move: stop hand-driving each step, design a pipeline that runs on a schedule with human approval gates.
Same loop instinct in two domains — you design the system and the gates, not every intermediate prompt.
Fokki’s pipeline chains six tools end-to-end:
Claude → Midjourney → Runway → ElevenLabs → Suno → Make
script → frames → motion → voice → music → publish
One Make scenario runs Monday and Thursday at 08:00: pull scripts from Google Drive, batch Midjourney scene prompts, download frames, send dialogue to ElevenLabs, pair images with Runway motion clips, assemble in a CapCut template, upload to YouTube with generated metadata, clip a 30-second X preview, post Patreon early access, and ping Telegram on completion. A separate on-demand webhook turns client briefs into finished explainers in shared Drive — quoted turnaround ~6 hours after a one-time ~5-hour setup.
Four SKUs share the pipeline: animated story series (6–10 min), brand explainers (60–90 sec), motion comics, and children’s bedtime channels. The human job is narrow: pick the story, pick the style, approve the output — roughly four hours of direction for a “24/7” factory, per the author.
Loop-engineering primitive
Fokki factory analogue
Key difference
Automations
Make.com schedule + webhook
No /goal or hooks — cron-style triggers only
Skills / context on disk
Reusable Midjourney character sheets, CapCut templates, voice cast notes
Creative consistency prompts, not SKILL.md
Sub-agent split
Tool specialization per stage (script vs frames vs motion)
No verifier sub-agent — human approves final cut
Connectors
Drive, YouTube, Patreon, Telegram APIs
Distribution stack, not MCP issue trackers
Feedback signal
Views, RPM, client acceptance
Business metrics — not CI, lint, or test gates
State / memory
Organised Drive folders per episode
Asset library, not AGENTS.md
What transfers to coding loops
Scheduled heartbeat — the factory does not wait for you to open a chat; neither should triage or CI-repair loops.
Stage-specialised tools — one brain trying to script, illustrate, animate, and score is the creative version of one agent grading its own code.
Performance direction in prompts — Fokki writes ElevenLabs stage direction (pauses, volume drops), not raw dialogue paste; coding loops need equally explicit done conditions in /goal text.
Human gate on output — “approve the episode” maps to Triage inbox review and PR merge — optimise human time, do not remove judgment.
Setup once, run indefinitely — the Make scenario is the media equivalent of wiring automations + skills once, then letting the loop compound.
Treat revenue figures in social factory posts as illustrative, not audited benchmarks. The architectural lesson is stable: factories — code or content — are designed loops with explicit stages, schedules, and gates. Coding loop engineering just demands harder oracles (tests, type checks, diffs) because “shipped” is easier to fake than “sounds convincing.”
Token economics and balance
Pattern
Approximate token load
Mitigation
Single-agent medium coding loop
50K–200K per run
Turn caps in /goal; cheaper model for explore/review
Fleet (orchestrator + 3 specialists)
500K–2M+ per cycle
Batch only parallelisable work; stuck detection
Scheduled daily automation
Millions per week if always-on
Archive empty runs; scope skills tightly
Sub-agents + /goal evaluator
Multiplicative per child session
Spend sub-agents on high-risk paths only
Loops are not free — patterns diverge wildly if you are “token rich” vs “token poor.” Direct prompting still matters for ambiguity and architecture. Loops handle repetition; you handle judgement. The leverage point moved — it did not disappear.
Performance summary
Dimension
Prompt era
Loop era
Your job
Write each turn
Design discover → plan → execute → verify → remember
Core cycle
Ask → answer
Five stages until verifiable done
Primitives
Context + prompt
6 shared building blocks (both major tools)
Done signal
You decide to stop
/goal evaluator, Stop hook, or environmental oracles
Scale
One thread
Worktrees + sub-agents + orchestration layer
Feedback
Your eyes
Layered oracles — not self-critique alone
Knowledge
Re-explained each session
Skills + VISION.md / AGENTS.md compound
Risk profile
Slower, more oversight
Faster, higher verification + comprehension debt
Bottom line
—
Build the loop — stay the engineer who reviews what ships
Research supplement
The following documentation pages from the official Claude Code docs provide additional technical depth beyond the article's reference links:
Scheduled Tasks (/loop): The Scheduled Tasks reference details how /loop works alongside cloud Routines and Desktop scheduled tasks, including the full comparison table of scheduling options, jitter behaviour, seven-day expiry, and the loop.md customisation mechanism. Notably, dynamic /loop schedules can use the Monitor tool internally to stream background process output, avoiding polling entirely.
Agent Loop Architecture: The Agent SDK: How the agent loop works page documents the full turn-and-message lifecycle, context window management, automatic compaction, and how max_turns / maxBudgetUsd bounds apply. It also explains how subagents start with a fresh conversation context, which has direct implications for keeping loop context efficient over long runs.
Key technical detail not in the primary reference links: The /goal command is implemented as a session-scoped prompt-based Stop hook. This means developers who need evaluation logic beyond a short text condition (for example, running an actual script to verify state) can write a custom Stop hook instead — which gives them the same turn-by-turn evaluation model with full scripting power.
Anthropic doubled Claude Cowork’s five-hour session rate limits for Pro, Max, and Team subscribers from 5 June through 5 July 2026, leaving weekly caps and the shared quota across Claude products unchanged.
Field
Detail
Date
Announced 5 June 2026; promotion through 5 July 2026
Vendor
Anthropic
Product
Claude Cowork (desktop knowledge-work agent)
Availability
Claude Pro, Max, and Team paid plans; Cowork only—not Claude Code or chat-specific boosts
Pricing / limits
2× five-hour rolling session allowance; weekly usage cap static; quota shared with Claude.ai and Claude Code
What changed
Boris Cherny, who leads Claude Code at Anthropic, announced the promotion on 5 June 2026 via social post—no dedicated article appeared on the Anthropic newsroom index by 9 June 2026.
Claude Cowork five-hour rolling session limits are doubled for approximately one month, ending 5 July 2026.
Eligible plans: Claude Pro, Claude Max, and Claude Team.
The change applies to five-hour rate-limit windows only—Anthropic’s weekly usage cap is unchanged.
Claude Code and Claude.ai retain standard session limits; the promotion is Cowork-specific.
Subscription quota remains a shared pool across Claude surfaces—heavier Cowork bursts can still exhaust the weekly budget faster.
Why it matters for engineers
Anthropic meters paid plans with two leaky buckets: a five-hour rolling session window for burst fairness and a weekly cap for cost control. Doubling only the first bucket optimises long desktop agent runs—folder reorganisation, batch report generation, scheduled digests—without raising Anthropic’s weekly compute exposure. Teams scheduling Cowork jobs should treat the promotion as session headroom, not unlimited capacity.
Cowork is not the Claude API. It runs in the desktop app with filesystem and Office integration, autonomous loops, and user approval gates—ideal for knowledge-worker delegation, unsuitable for production services. Engineers should keep CI and production agents on API metering while pilots use Cowork inside the promo window for deferred “messy folder” projects Cherny highlighted.
Unified quota across Cowork, Claude Code, and web chat means platform leads need allocation policy. A seat running heavy Code sessions the same week as a doubled Cowork migration may hit the unchanged weekly ceiling before the session window resets. Monitor Settings → Usage for both progress bars before kicking off multi-hour agent tasks.
Enterprise admins already manage Cowork feature access and org spend caps separately from consumer tiers. Communicate the 5 July revert date so programme managers do not assume permanent 2× session limits in capacity plans.
Anthropic doubled the five-hour Cowork usage bucket for eligible paid plans from 5 June through 5 July 2026 whilst leaving weekly caps unchanged.
Limit windows over the promotion
flowchart TB
START["5 Jun 2026 promo starts"]
SESSION["Five-hour rolling window resets continuously"]
DOUBLE["Cowork session allowance 2x"]
WEEKLY["Weekly cap unchanged"]
SHARED["Shared pool: Cowork chat and Code"]
END["5 Jul 2026 promo ends"]
START --> DOUBLE
DOUBLE --> SESSION
SESSION --> SHARED
SHARED --> WEEKLY
WEEKLY --> END
classDef agent fill:#8B0000,color:#fff
classDef tool fill:#189AB4,color:#fff
class DOUBLE agent
class WEEKLY tool
Timeline view: session windows roll continuously and temporarily widen for Cowork; the weekly ceiling and cross-product pool stay fixed.
Research supplement
Web search and page fetch tools were not available during this session. No additional reputable sources beyond those provided by the author could be verified. The sections above draw exclusively on the article text and the three reference URLs supplied (claude.com/product/cowork, support.anthropic.com/en/articles/9797557-usage-limit-best-practices, claude.com/pricing).