At WWDC26 on 8 June 2026, Apple previewed Siri AI and the next generation of Apple Intelligence on iOS 27, iPadOS 27, and macOS 27—powered by Apple Foundation Models built with Google Gemini and split across on-device Apple silicon and Private Cloud Compute.
Field
Detail
Date
Announced 8 June 2026 (WWDC26)
Vendor
Apple
Products
Siri AI; Apple Intelligence across iOS/iPadOS/macOS/watchOS/visionOS 27
Model stack
Apple Foundation Models (Gemini collaboration); on-device + Private Cloud Compute
Developer frameworks
Foundation Models framework (Swift, on-device + PCC + third-party LLMs); Core AI (custom PyTorch on Apple silicon); App Intents for Siri actions
Availability
Developer Program beta 8 June 2026 (iOS/iPadOS/macOS/visionOS); watchOS Siri beta later; public beta next month; user Siri beta English-first later in 2026; GA fall 2026
iPhone Air, iPhone 17 Pro/Max, iPad (M4)+ with ≥12GB RAM, Mac (M3)+ with ≥12GB, Vision Pro (M5) — expressive voices, advanced dictation
Pricing / limits
Server-model features (e.g. photorealistic Image Playground) carry daily usage caps; expanded access on most iCloud+ plans (numeric quotas not published); compatible Home cameras included on qualifying iCloud+ tiers
Regional gates
EU: Siri AI on Mac and Vision Pro initially, not iOS/iPadOS/watchOS; China: unavailable pending regulatory work; Apple Intelligence supports 17 languages
What changed
Siri AI replaces the legacy assistant with personal-context search across Messages, Mail, and Photos; on-screen and Camera-mode awareness; expanded systemwide app actions; web-grounded answers; and a dedicated Siri app with iCloud-private conversation sync across iPhone, iPad, Mac, Watch, and Vision Pro.
Invocation surfaces expand beyond “Hey Siri” to Dynamic Island swipe (iPhone), Spotlight (iPad/Mac), control-click context menus, and Vision Pro look-to-speak with 3D visualisation.
On-device plumbing includes a system orchestrator, Spotlight index, and App Toolbox that keep personal-context processing local before escalating frontier workloads.
Apple Foundation Models are custom-built in collaboration with Google Gemini for deeply integrated experiences—not exposed as a raw Gemini API to consumers per Apple’s Intelligence announcement.
Hybrid execution runs models on device and on Private Cloud Compute; PCC retains Apple’s no-storage privacy promise with ongoing external verification.
Image Playground adds photorealistic generation on PCC with hidden SynthID watermarks; Photos gains Spatial Reframing and other on-device intelligence features.
Developer betas for Siri AI ship 8 June 2026 on iOS, iPadOS, macOS, and visionOS; watchOS follows in a future beta.
Developer integration surface
Foundation Models framework (Swift) is the primary LLM integration path: on-device sessions, Private Cloud Compute for frontier tasks, tool calling, Dynamic Profiles for multi-model routing, and third-party models via the Language Model protocol (Gemini, Claude, and others). Apple plans to open-source the framework core later in summer 2026. Use it when you want Apple-hosted intelligence inside your app without managing API keys or PCC authentication.
Core AI is a separate stack for deploying custom PyTorch models on Apple silicon—Python conversion tools, ahead-of-time compilation in Xcode, Swift inference APIs, and Core AI debugging instruments. Use Core AI when you bring your own weights; use Foundation Models when you consume Apple’s Foundation Models or attach approved third-party LLM providers.
App Intents and Spotlight integrations extend Siri AI personal context to third-party apps. View Annotations and on-screen-awareness APIs let apps participate in Siri’s screen-context flows without exposing raw screenshots to external model vendors.
Why it matters for engineers
Apple’s WWDC26 stack is a platform inference architecture, not a single model API. Builders should plan for dual execution paths: on-device Foundation Models for latency- and privacy-sensitive personal context, and PCC for frontier workloads (photorealistic image generation, broad world knowledge) with quota limits. This article covers the consumer Siri AI and developer framework launch; it is distinct from Apple’s PCC infrastructure expansion on Google Cloud NVIDIA hardware, which focused on attestation, fleet ledgers, and confidential-GPU hosting rather than Siri UX and App Intents.
Feature-detect against two hardware tiers before shipping voice or dictation features: the base Apple Intelligence list (iPhone 16+, M1+ Mac/iPad) differs from the advanced on-device model tier (M4+/M3+ with ≥12GB unified memory, iPhone 17 Pro family) required for expressive voices and advanced dictation.
Server-model daily caps and iCloud+ entitlements mean client apps must degrade gracefully when users exhaust allotments—Apple has not published numeric quotas, but photorealistic Image Playground and similar PCC-backed features are explicitly rate-limited. Enterprise Mac teams should plan fall GA as a coordinated OS 27 rollout with regional gates: EU iOS/iPadOS Siri AI is deferred whilst Mac and Vision Pro proceed.
For teams comparing hyperscaler assistants: Apple exposes no raw Gemini or Claude endpoint. Capabilities arrive through Foundation Models framework sessions and Siri AI system channels—simplifying privacy review but limiting custom prompt engineering relative to direct API integrations.
Personal-context Siri workloads stay on Apple silicon; frontier models run in Private Cloud Compute without storing user prompts.
Intelligence routing at WWDC26
flowchart TB
USER["User or app request"]
LOCAL["On-device Foundation Models"]
PCC["Private Cloud Compute"]
ANS["Response to user"]
USER --> LOCAL
LOCAL -->|"personal context"| ANS
LOCAL -->|"frontier workload"| PCC
PCC --> ANS
Research supplement
Web search was unavailable during production of this post. The following notes flag external sources worth checking to deepen specific claims in the article — all URLs listed are from the author's own reference set and are not newly discovered sources.
PCC architecture and security model: Apple first published technical documentation on Private Cloud Compute at WWDC24 and via its security research blog. Readers seeking the external verification mechanism referenced in this article should consult Apple's current security documentation for any updates since the original 2024 PCC white paper.
SynthID watermarking: SynthID is Google DeepMind's AI content watermarking standard. Its appearance in Apple's Image Playground outputs is a direct consequence of the Gemini collaboration. DeepMind's public SynthID documentation would clarify the detection and verification process for watermarked outputs.
App Intents and Core AI framework evolution: The Core AI framework reference at developer.apple.com/documentation/coreai (author reference #3) is the authoritative current source for developer integration details; readers building for iOS 27 should treat this as primary documentation over any third-party summary.
Anthropic shipped Claude Fable 5 on 9 June 2026—a Mythos-class frontier model for general use with classifier fallbacks to Claude Opus 4.8 on sensitive cyber, biology, and distillation queries—alongside restricted Claude Mythos 5 access for Project Glasswing defenders and separate biology trusted-access programmes.
Short video walkthrough
Engineering walkthrough — ElevenLabs narration, HeyGen bookends, API vs claude.ai defaults, and official Anthropic B-roll (~6 min).
Field
Detail
Date
General availability 9 June 2026
Vendor
Anthropic
Products
Claude Fable 5 (GA); Claude Mythos 5 (Glasswing cyber partners only)
API model ID
claude-fable-5 (Mythos 5 has no general API ID)
Availability
API and consumption-based Enterprise: full access from launch; claude.ai and third-party surfaces; subscription plans staged through 22 June 2026
Included on Pro, Max, Team, and seat-based Enterprise through 22 June 2026; usage credits from 23 June until capacity allows reinclusion
Safeguards
Cyber, bio/chem, and distillation classifiers route to Opus 4.8 with user notification; triggers in <5% of sessions on average (>95% run Fable with Mythos-equivalent performance)
Data retention
30-day retention on Mythos-class business traffic (first- and third-party surfaces); not used for training; human access logged
What changed
Claude Fable 5 is Anthropic’s first Mythos-class model generally available, with state-of-the-art scores on software engineering, knowledge work, vision, and long-horizon agent benchmarks—lead grows as tasks become longer and more complex per the launch post.
New safety classifiers extend constitutional-classifier work: cyber (exploitation plus offensive agentic hacking), biology/chemistry (broad fallback during launch), and distillation (large-scale capability extraction) all route flagged prompts to Claude Opus 4.8 instead of refusals.
Claude Mythos 5 shares Fable 5 weights with cyber safeguards lifted for existing Project Glasswing partners upgrading from Mythos Preview; comparable or stronger performance at substantially lower cost.
Biology trusted access (separate from Mythos 5) will offer Fable 5 with bio/chem classifiers removed but cyber classifiers still active to a small life-sciences cohort—broader enrolment planned as safeguards narrow.
Pricing halved versus Mythos Preview on API and consumption-based Enterprise plans.
30-day retention is required for Mythos-class business traffic to detect novel jailbreaks; data deleted after 30 days with logged human access (Anthropic support article).
Red-team validation: external bug bounty reported no universal jailbreak in 1,000+ hours; zero compliance on harmful single-turn cyber requests across 30 public jailbreak techniques in partner testing.
Subscription rollout is demand-sensitive: included at no extra cost on paid Claude plans through 22 June 2026, then usage credits until capacity stabilises.
Capability evidence for builders
Software engineering: Stripe reported a 50-million-line Ruby migration in one day (versus an estimated two-plus months manually); Cognition’s FrontierCode ranks Fable 5 highest among frontier models at medium effort with improved token efficiency.
Knowledge work: highest score on Hebbia’s Finance Benchmark; IMC reported near-perfect trading-analysis results across factual lookup, root-cause analysis, and expected-value reasoning.
Vision: state-of-the-art on vision tasks; completed Pokémon FireRed vision-only without navigation harnesses that prior Claude models required.
Memory: on Slay the Spire agent runs, file-based memory produced threefold improvement versus Opus 4.8 and threefold higher final-act completion rates.
Alignment: automated assessments place Mythos 5 misaligned behaviour similar to Opus 4.8 per the system card.
Why it matters for engineers
Teams wiring production agents must treat Fable 5 as a two-model endpoint: more than 95% of sessions never trigger fallback, but cyber-hardening, bioinformatics, or suspicious bulk-extraction patterns transparently downgrade to Opus 4.8 with user notification. Log response metadata and surface fallback events to operators—latency and capability profiles differ, and conservative classifier tuning means benign security research queries can still trip safeguards during the launch window.
The API and consumption-based Enterprise path is the reliable integration surface from day one. Subscription inclusion is time-boxed and demand-sensitive; capacity planning for long autonomous coding runs should prefer metered API tiers. Mythos 5 remains outside general API access—cyber defenders need Glasswing or a future trusted-access application; biology researchers follow the separate Fable-without-bio-classifiers programme.
Long-context and file-backed memory improvements matter for multi-hour agent loops: Fable 5 sustains focus across millions of tokens and benefits disproportionately from persistent notes versus Opus 4.8. Vision-only harnesses now complete screenshot-to-code and scientific-figure extraction tasks that previously required scaffolding.
Regulated workloads must account for 30-day Mythos-class retention on business traffic, logged human access to stored prompts, and explicit prohibition on training use. Benchmark harnesses that resemble distillation attacks may trigger classifiers—design eval pipelines to tolerate Opus 4.8 fallbacks or isolate test traffic from production API keys.
Most Fable 5 sessions run at full frontier capability; cyber, biology, and distillation classifiers route sensitive prompts to Opus 4.8 instead of blocking.
Classifier fallback in production
flowchart LR
REQ["Agent or app request"]
CLS["Safety classifiers"]
FABLE["Fable 5 response"]
OPUS["Opus 4.8 fallback"]
OUT["Answer delivered"]
REQ --> CLS
CLS -->|"typical workload"| FABLE
CLS -->|"cyber bio distillation"| OPUS
FABLE --> OUT
OPUS --> OUT
Research supplement
Web search was not available in this environment. The following context is drawn from the article and linked reference materials only.
The classifier-fallback approach described in Fable 5 relates to broader AI safety literature on output filtering versus refusal. Anthropic's published safety work (ASL-3 and higher commitments) has flagged cyber and CBRN (chemical, biological, radiological, nuclear) as priority dual-use categories — the three Fable 5 classifier domains (cyber, bio/chem, distillation) map directly onto these commitments. The system card cited in the article (claude-fable-5-mythos-5-system-card) is the primary source for evaluating classifier accuracy claims independently.
Project Glasswing is described at anthropic.com/glasswing as a defenders-focused initiative; the article does not reproduce its full scope. Engineers evaluating Mythos 5 access should consult that page directly for enrollment criteria.
The API model ID (claude-fable-5) and current pricing are listed in Anthropic's models overview at platform.claude.com/docs/en/about-claude/models/overview, which is the authoritative source for integration and should be checked against the article's stated rates before capacity planning.
Google Colab CLI turns Colab from a browser-only notebook into a programmable remote runtime you drive from your terminal — provision a T4 or A100, pipe a local .py file to a Jupyter kernel in the cloud, pull checkpoints back, and tear the VM down, without opening a tab. Google shipped it in June 2026 as an agent-ready bridge between local dev machines and Colab compute.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
T[Local terminal] -->|colab new / exec| API[Colab assign API]
API -->|runtime proxy token| VM[Remote Colab VM]
VM --> K[Jupyter kernel]
K --> GPU[GPU or TPU]
VM -->|colab download| A[Local artifacts]
API -->|keep-alive 60s| VM
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
class T,A agent
class API,VM,K,GPU hook
What problem it solves
Before the CLI, Colab meant: open a notebook in Chrome, click Connect, upload files manually, and babysit the runtime. That breaks down for shell pipelines, CI-style jobs, and coding agents that only speak bash. The CLI exposes the same rented VMs through commands like colab new --gpu T4, colab exec -f train.py, and colab run --gpu T4 train.py — a one-shot provision → execute → teardown path.
Google’s launch post positions it for both humans and agents: any tool with terminal access (Claude Code, Codex, Antigravity, etc.) can provision accelerators, install packages with uv, run local scripts remotely, export replayable .ipynb logs, and download weights — without writing cloud provisioning code yourself.
How the architecture works
Layer
What it does
Where it lives
CLI (Typer)
Commands, session names, auth
Your Mac or Linux machine
Assign API
Allocate VM, return endpoint + proxy token
colab.research.google.com/tun/m/assign
Keep-alive daemon
Ping every 60s; 24h cap
Detached local process per session
Jupyter kernel
Execute Python via WebSocket
Remote VM (/content cwd)
Contents API
Upload/download/list files
Same VM via Jupyter HTTP
Local state
Session metadata, kernel id
~/.config/colab-cli/sessions.json
Important detail: colab exec -f script.py reads the file locally and sends source to the kernel — you do not need a separate upload step for execution. Use colab upload / colab download for datasets, checkpoints, and zips.
Install and authenticate
# Recommended
uv tool install google-colab-cli
# Or pip (requires Python 3.13+)
pip install google-colab-cli
# Quick smoke test
colab new
echo "print('Hello from Colab')" | colab exec
colab stop
Two auth layers matter:
CLI → Colab control plane — --auth oauth2 (browser flow, token in ~/.config/colab-cli/token.json) or --auth adc (Application Default Credentials — preferred for agents).
VM → GCP services — colab auth inside a session for BigQuery/GCS; separate from CLI login.
Accelerator access is subscription- and quota-gated. HTTP 400 on colab new --gpu X usually means no entitlement — fall back to T4 or CPU. Unrecognized --gpu values silently map to A100 in the client; spell GPU names exactly.
Built for coding agents
The CLI ships COLAB_SKILL.md via colab skill — agents get session rules, safe commands, and ADC auth without scraping the README.
Google’s Gemma fine-tuning demo is the canonical agent pattern:
For parallel jobs, isolate state: colab --config /tmp/job-a.json new -s trainer-a. Always name sessions and call colab stop — idle VMs burn compute units even with keep-alive.
chmod +x script.py && ./script.py provisions a fresh VM, runs the script with forwarded sys.argv, propagates exit codes, and tears down unless --keep is set. CLI status messages go to stderr; script stdout stays clean for piping.
Web search was unavailable in this environment. The research supplement is left empty pending external verification of specific Colab CLI documentation, authentication details, and quota behaviour.
Loop engineering means you stop being the person who types every prompt to a coding agent — and start designing a small system that discovers work, delegates it, checks it, remembers progress, and repeats. The leverage moves from prompt craft to loop design: six primitives that now ship inside tools like Claude Code and the Codex app instead of bespoke bash you maintain forever.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
subgraph Stack["Three layers"]
H[Harness engineering] --> L[Loop engineering]
L --> O[Orchestration layer]
end
H -->|one agent runtime| T[Tools memory sandbox]
L -->|schedule + verify| P[Six primitives]
O -->|fleet + PR lifecycle| R[Reactions state machine]
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
classDef decision fill:#444,color:#fff
class H,L,O agent
class T,P,R hook
Where the conversation landed in 2026
The shift is no longer niche. Boris Cherny, who leads Claude Code at Anthropic, described it on the Acquired podcast as: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figure out what to do. My job is to write loops.” Peter Steinberger put the same idea on X: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Both are saying the human job moved up one floor — from typing each turn to designing feedback systems.
That floor has three names in practice. Harness engineering is the runtime around one agent (tools, memory, permissions). Loop engineering is the harness that runs on a schedule, spawns helpers, and feeds itself from disk. Orchestration is the layer above when you need fleets of agents across worktrees, PRs, and CI — with automatic routing of failures back to the right session.
The universal five-stage cycle
Every serious loop — single agent or fleet — runs the same cycle until a verifiable stop condition holds.
Stage
What happens
Typical tooling
Discover
Find work: CI failures, issues, diffs, inbox
Automations, /loop, triage skills
Plan
Break goal into steps with constraints
Skills, VISION.md, spec sub-agent
Execute
Edit code, run tools, open PRs
Worktrees, MCP connectors
Verify
Push against objective signals — not model opinion
Tests, lint, /goal evaluator, critic sub-agent
Iterate
Fix gaps and loop again
Stop hooks, reactions, state file
A prompt gives instructions for one turn. A loop gives a job: discover → plan → execute → verify → iterate until done. You set the goal; the loop runs itself.
Open loops vs closed loops
Open loop
Closed loop
Nature
Exploratory; wide search space
Bounded path you designed
Risk
Token burn; “slop machine” without gates
Cheaper; predictable
Needs
Large budget + strong evaluators
Clear goal, defined steps, stop condition
Start here?
Research spikes, benchmarks
Production coding, triage, migrations
Closed loops need five ingredients on disk: goal (precise done), context (VISION.md, ARCHITECTURE.md, RULES.md), action (scoped tools), feedback (tests, lint, structured errors), and a stop condition (/goal text, Stop hook, or orchestrator brief). Without a quality gate, AI drifts; with one, it improves.
Single-agent loop vs fleet loop
Single-agent loop
Fleet loop
Shape
One brain runs discover→verify end-to-end
Orchestrator splits work across specialists
Good for
Focused refactors, /goal migrations
Large features, parallel PRs, research→build→QA chains
Token profile
~50K–200K tokens per medium coding task
~500K–2M+ when orchestrator + 3+ specialists run
Example split
Explore → implement → verify sub-agents
Research specialist → engineering specialist → QA specialist, each with its own loop
What changed in agentic development
For roughly two years, “good AI coding” meant writing strong prompts and feeding enough context each turn. You typed, read, typed again — the agent was a power tool and you held the handle every step.
Loop engineering is the next layer: a recursive goal where you define purpose and done, and the system iterates until a verifiable condition holds. You design once; the loop pokes agents on a schedule or across turns. This sits one floor above agent harness engineering (the environment one agent runs in) and the factory model (the system that builds software) — same family of ideas, but the harness now runs on a timer, spawns helpers, and feeds itself from disk-based memory.
The six primitives every loop needs
Five action primitives plus persistent state — the shape is the same across major coding-agent products.
#
Primitive
Job in the loop
Without it
1
Automations
Scheduled discovery and triage
You manually check CI, issues, and diffs
2
Worktrees
Isolate parallel agent checkouts
Two agents overwrite the same files
3
Skills
Project knowledge on disk (SKILL.md)
Agent re-guesses conventions every run
4
Connectors (MCP)
Issues, DB, Slack, staging APIs
Agent only sees the filesystem
5
Sub-agents
Separate maker and checker roles
One model grades its own homework
6
State / memory
Markdown, Linear board, AGENTS.md
Model forgets between runs; loop restarts blind
The agent forgets; the repo does not. Long-running loops depend on external state — not context window — to remember what was tried, what passed, and what is next. Common context files beyond SKILL.md: VISION.md (what success looks like), ARCHITECTURE.md (stack and layout), RULES.md (forbidden actions), GUARDRAILS.md (always-on checklists), and AGENTS.md (repo map for agents).
Codex app vs Claude Code — same shape, different names
Primitive
Codex app
Claude Code
Automations
Automations tab: project, prompt, cadence, local or worktree env; Triage inbox; thread vs standalone runs
Once you see the shared shape, the debate shifts from “which tool” to “which loop design still works in either seat.”
1. Automations — the heartbeat
Automations turn a one-off agent run into a loop. In the Codex app you configure project, prompt, schedule, and environment (local checkout or background worktree). Runs with findings land in a Triage inbox; empty runs archive themselves. Internal uses include daily issue triage, CI failure summaries, commit briefings, and regression hunts. Automations can call $skill-name so recurring logic stays maintainable.
Claude Code reaches the same outcome via /loop (interval reruns), cron scheduling, lifecycle hooks, Desktop scheduled tasks (persistent while app is open), Cloud Routines (runs when laptop is closed), or GitHub Actions for headless runs.
Interactive pick: /goal vs /loop vs Stop hooks
Mechanism
Next turn starts when…
Stops when…
Best for
/goal (Claude)
Previous turn finishes
Separate evaluator model confirms condition (reads transcript only)
Migrations, refactors, “all tests green”
/goal (Codex)
Thread idle after turn
Evidence in thread supports completion; pause/resume/clear/budget
# Claude Code — run until tests and lint are clean (v2.1.139+)
/goal all tests in test/auth pass and the lint step is clean
# Check spend and evaluator reasoning
/goal
# Stop early
/goal clear
# Headless single invocation
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"
# Codex — long-running performance goal (cookbook pattern)
/goal Reduce p95 checkout latency below 120 ms, verified by the checkout benchmark,
while keeping the correctness suite green. If blocked, stop with evidence.
/goal on Claude Code starts a turn immediately; after each turn Haiku (by default) judges yes/no from the transcript — it does not run tools. Codex /goal is thread-scoped with explicit budget accounting and pause/resume. Pair either with auto mode so each turn skips per-tool confirmations.
2. Worktrees — parallel without collisions
Two agents editing the same file is the same failure mode as two engineers on one branch without coordination. A git worktree is a separate working directory on its own branch, sharing history but not files. Codex threads use worktrees natively; Claude Code offers --worktree sessions and isolation: worktree on subagents that clean up after themselves.
Worktrees remove mechanical collision; your review bandwidth still caps how many parallel agents you can actually supervise.
3. Skills — stop paying intent debt every session
Agents start cold. Every missing convention becomes a confident guess — intent debt. A skill is intent written outside the chat: a folder with SKILL.md, optional scripts, references, and assets. Both Codex and Claude Code load skills when you invoke $name or when the task matches a tight, boring description (clever descriptions match too often).
Skill vs plugin: the skill is the authoring format; a plugin bundles skills and connectors for teammates to install once.
4. Connectors — act in your real environment
MCP connectors let the loop read Linear/Jira, query databases, hit staging APIs, and post to Slack. That is the difference between “here is the fix” and “open the PR, link the ticket, ping the channel when CI is green.” Plugins package connectors with skills so onboarding is one install, not tribal memory.
Feedback signals that keep loops honest
A loop with nothing to push against is just the agent agreeing with itself — layer deterministic, perceptual, and critic signals.
Signal type
Examples
Strength
Deterministic oracles
CI, unit tests, type checks, linters, git diff, scalar metrics (e.g. benchmark p95)
Strongest — pass/fail without model judgment
Perceptual / visual
Playwright, browser MCP tools, layout screenshots
Medium — catches UI regressions code tests miss
Critic sub-agents
Separate reviewer agent; forces retry or stop
Medium — judgment, but not the worker context
Persistent context
GUARDRAILS.md, skills, checklists loaded every run
Always-on oracle
LLM self-critique only
“Does this look good?” from same model
Weakest — rationalises its own mistakes
Strongest systems stack multiple signal types: deterministic for reliability, visual/critic for judgment, human gates on high-stakes merges. Signals must route back automatically — full logs, diffs, scores — without you copy-pasting CI output each turn.
5. Sub-agents — maker vs checker
The highest-leverage split: implement in one agent, verify in another — including /goal’s separate done-evaluator.
The model that wrote the code is too lenient grading itself. A second agent — different instructions, sometimes a different model — catches rationalised mistakes. Typical trio: explore, implement, verify against spec. In fleet setups, a validator agent reports truth without fixing — failures loop back to the builder.
# Codex — custom subagent (simplified .codex/agents/security-reviewer.toml)
name = "security-reviewer"
description = "Read-only security pass on diffs"
instructions = "Find auth, injection, and secret-leak risks. No edits."
model = "strong"
reasoning_effort = "high"
Sub-agents cost extra tokens (each runs its own model + tools). Spend them where a second opinion unlocks unattended runs — the only reason you can walk away from a loop.
Orchestration — when one loop is not enough
Single-session /goal loops solve “finish this migration without me re-prompting.” Fleet-scale work needs an orchestration layer: deterministic plumbing plus an orchestrator agent for judgment.
Layer
Job
Examples
Deterministic plumbing
Route environmental feedback automatically
CI fail → inject logs into worker session; PR conflict → notify right agent; lifecycle state machine (working → ci_failed → review_pending → merged)
Orchestrator agent
Decompose goals, write briefs, batch parallel work
Research agent → spec → tracking issue → N workers in isolated worktrees
Human gates
Vision, acceptance, high-risk merges
Triage inbox, PR approval — optimise human time, not remove humans
Open-source reference implementations like Agent Orchestrator (npm install -g @aoagents/ao) ship reactions engines, worktree isolation, and orchestrator prompts out of the box. The pattern: inner agents execute in bounded loops; outer orchestrator coordinates; environmental signals keep loops honest; you stay on vision and judgment.
Walkthrough: one morning triage loop
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
sequenceDiagram
participant Auto as Morning automation
participant Skill as Triage skill
participant State as STATE.md
participant WT as Worktree
participant Maker as Fix sub-agent
participant Check as Review sub-agent
participant MCP as Connectors
Auto->>Skill: Run on schedule
Skill->>State: Write CI failures + issues
loop Each actionable item
Auto->>WT: Open isolated checkout
WT->>Maker: Draft fix
Maker->>Check: Submit diff
Check-->>Maker: Approve or reject
Maker->>MCP: Open PR + update ticket
end
Auto->>State: Log done / blocked for human inbox
Findings — Written to STATE.md or a Linear board (memory outside the chat).
Per item — New worktree → maker sub-agent drafts fix → checker sub-agent runs against project skills + tests.
Ship — Connectors open PR and update tickets; blocked items land in your inbox.
Tomorrow — State file tells the loop what was tried, passed, or still open.
You designed this once. You did not prompt each step — that is the whole point.
Prompt engineer vs loop engineer
Prompt engineer
Loop engineer
Crafts better instructions per turn
Designs feedback cycles and stop conditions
Linguistic skill
Systems / software engineering skill
Better single output
Reliable verified outcomes across runs
You review manually each time
System self-corrects against oracles
You are the feedback loop
The loop is the feedback loop
“Write me a function”
“Write → test → fix until green”
Self-check: is your loop healthy?
Question
Healthy loop
Leaky loop
What proves “done”?
Tests, lint, measurable condition in /goal
Agent says “looks good”
Where does memory live?
Repo file or issue tracker
Only in chat context
Who verifies?
Separate sub-agent or evaluator model
Same agent that wrote code
What pushes back?
Layered oracles (CI + critic + human gate)
Self-critique only
Parallelism?
One worktree per agent
Shared checkout
Token budget?
Turn cap in condition or manual clear
Open-ended overnight /goal
Your role?
Review merged outcomes you understand
Press go and hope
What loops do not remove — three sharper risks
Verification stays human
An unattended loop is also an unattended mistake machine. Even with a verifier sub-agent, “done” is a claim, not proof. Ship code you confirmed works — especially when diff sizes balloon because agents touch more files than necessary.
Comprehension debt accelerates
The faster the loop ships code you did not write, the wider the gap between what exists and what you understand. Read the reasoning, skim the diff, trace the decision log — or the loop makes the debt grow faster, not slower.
Cognitive surrender
When automation feels smooth, it is tempting to stop having opinions. Loop design with judgement keeps you the engineer; loop design to avoid thinking is the same UI with opposite outcomes. Two teams can run identical loops — one moves faster on work they deeply understand; the other outsources understanding entirely. The loop cannot tell the difference. You can.
Parallel pattern: scheduled content factories
The same week loop engineering went mainstream for coding, creators published parallel “factory” playbooks for media. @0x_fokki’s X Article I Built an AI Animation Factory That Runs 24/7 is not a coding-agent harness — Claude is used as a scriptwriter, not a repo editor — but it shows the same design move: stop hand-driving each step, design a pipeline that runs on a schedule with human approval gates.
Same loop instinct in two domains — you design the system and the gates, not every intermediate prompt.
Fokki’s pipeline chains six tools end-to-end:
Claude → Midjourney → Runway → ElevenLabs → Suno → Make
script → frames → motion → voice → music → publish
One Make scenario runs Monday and Thursday at 08:00: pull scripts from Google Drive, batch Midjourney scene prompts, download frames, send dialogue to ElevenLabs, pair images with Runway motion clips, assemble in a CapCut template, upload to YouTube with generated metadata, clip a 30-second X preview, post Patreon early access, and ping Telegram on completion. A separate on-demand webhook turns client briefs into finished explainers in shared Drive — quoted turnaround ~6 hours after a one-time ~5-hour setup.
Four SKUs share the pipeline: animated story series (6–10 min), brand explainers (60–90 sec), motion comics, and children’s bedtime channels. The human job is narrow: pick the story, pick the style, approve the output — roughly four hours of direction for a “24/7” factory, per the author.
Loop-engineering primitive
Fokki factory analogue
Key difference
Automations
Make.com schedule + webhook
No /goal or hooks — cron-style triggers only
Skills / context on disk
Reusable Midjourney character sheets, CapCut templates, voice cast notes
Creative consistency prompts, not SKILL.md
Sub-agent split
Tool specialization per stage (script vs frames vs motion)
No verifier sub-agent — human approves final cut
Connectors
Drive, YouTube, Patreon, Telegram APIs
Distribution stack, not MCP issue trackers
Feedback signal
Views, RPM, client acceptance
Business metrics — not CI, lint, or test gates
State / memory
Organised Drive folders per episode
Asset library, not AGENTS.md
What transfers to coding loops
Scheduled heartbeat — the factory does not wait for you to open a chat; neither should triage or CI-repair loops.
Stage-specialised tools — one brain trying to script, illustrate, animate, and score is the creative version of one agent grading its own code.
Performance direction in prompts — Fokki writes ElevenLabs stage direction (pauses, volume drops), not raw dialogue paste; coding loops need equally explicit done conditions in /goal text.
Human gate on output — “approve the episode” maps to Triage inbox review and PR merge — optimise human time, do not remove judgment.
Setup once, run indefinitely — the Make scenario is the media equivalent of wiring automations + skills once, then letting the loop compound.
Treat revenue figures in social factory posts as illustrative, not audited benchmarks. The architectural lesson is stable: factories — code or content — are designed loops with explicit stages, schedules, and gates. Coding loop engineering just demands harder oracles (tests, type checks, diffs) because “shipped” is easier to fake than “sounds convincing.”
Token economics and balance
Pattern
Approximate token load
Mitigation
Single-agent medium coding loop
50K–200K per run
Turn caps in /goal; cheaper model for explore/review
Fleet (orchestrator + 3 specialists)
500K–2M+ per cycle
Batch only parallelisable work; stuck detection
Scheduled daily automation
Millions per week if always-on
Archive empty runs; scope skills tightly
Sub-agents + /goal evaluator
Multiplicative per child session
Spend sub-agents on high-risk paths only
Loops are not free — patterns diverge wildly if you are “token rich” vs “token poor.” Direct prompting still matters for ambiguity and architecture. Loops handle repetition; you handle judgement. The leverage point moved — it did not disappear.
Performance summary
Dimension
Prompt era
Loop era
Your job
Write each turn
Design discover → plan → execute → verify → remember
Core cycle
Ask → answer
Five stages until verifiable done
Primitives
Context + prompt
6 shared building blocks (both major tools)
Done signal
You decide to stop
/goal evaluator, Stop hook, or environmental oracles
Scale
One thread
Worktrees + sub-agents + orchestration layer
Feedback
Your eyes
Layered oracles — not self-critique alone
Knowledge
Re-explained each session
Skills + VISION.md / AGENTS.md compound
Risk profile
Slower, more oversight
Faster, higher verification + comprehension debt
Bottom line
—
Build the loop — stay the engineer who reviews what ships
Research supplement
The following documentation pages from the official Claude Code docs provide additional technical depth beyond the article's reference links:
Scheduled Tasks (/loop): The Scheduled Tasks reference details how /loop works alongside cloud Routines and Desktop scheduled tasks, including the full comparison table of scheduling options, jitter behaviour, seven-day expiry, and the loop.md customisation mechanism. Notably, dynamic /loop schedules can use the Monitor tool internally to stream background process output, avoiding polling entirely.
Agent Loop Architecture: The Agent SDK: How the agent loop works page documents the full turn-and-message lifecycle, context window management, automatic compaction, and how max_turns / maxBudgetUsd bounds apply. It also explains how subagents start with a fresh conversation context, which has direct implications for keeping loop context efficient over long runs.
Key technical detail not in the primary reference links: The /goal command is implemented as a session-scoped prompt-based Stop hook. This means developers who need evaluation logic beyond a short text condition (for example, running an actual script to verify state) can write a custom Stop hook instead — which gives them the same turn-by-turn evaluation model with full scripting power.
Anthropic doubled Claude Cowork’s five-hour session rate limits for Pro, Max, and Team subscribers from 5 June through 5 July 2026, leaving weekly caps and the shared quota across Claude products unchanged.
Field
Detail
Date
Announced 5 June 2026; promotion through 5 July 2026
Vendor
Anthropic
Product
Claude Cowork (desktop knowledge-work agent)
Availability
Claude Pro, Max, and Team paid plans; Cowork only—not Claude Code or chat-specific boosts
Pricing / limits
2× five-hour rolling session allowance; weekly usage cap static; quota shared with Claude.ai and Claude Code
What changed
Boris Cherny, who leads Claude Code at Anthropic, announced the promotion on 5 June 2026 via social post—no dedicated article appeared on the Anthropic newsroom index by 9 June 2026.
Claude Cowork five-hour rolling session limits are doubled for approximately one month, ending 5 July 2026.
Eligible plans: Claude Pro, Claude Max, and Claude Team.
The change applies to five-hour rate-limit windows only—Anthropic’s weekly usage cap is unchanged.
Claude Code and Claude.ai retain standard session limits; the promotion is Cowork-specific.
Subscription quota remains a shared pool across Claude surfaces—heavier Cowork bursts can still exhaust the weekly budget faster.
Why it matters for engineers
Anthropic meters paid plans with two leaky buckets: a five-hour rolling session window for burst fairness and a weekly cap for cost control. Doubling only the first bucket optimises long desktop agent runs—folder reorganisation, batch report generation, scheduled digests—without raising Anthropic’s weekly compute exposure. Teams scheduling Cowork jobs should treat the promotion as session headroom, not unlimited capacity.
Cowork is not the Claude API. It runs in the desktop app with filesystem and Office integration, autonomous loops, and user approval gates—ideal for knowledge-worker delegation, unsuitable for production services. Engineers should keep CI and production agents on API metering while pilots use Cowork inside the promo window for deferred “messy folder” projects Cherny highlighted.
Unified quota across Cowork, Claude Code, and web chat means platform leads need allocation policy. A seat running heavy Code sessions the same week as a doubled Cowork migration may hit the unchanged weekly ceiling before the session window resets. Monitor Settings → Usage for both progress bars before kicking off multi-hour agent tasks.
Enterprise admins already manage Cowork feature access and org spend caps separately from consumer tiers. Communicate the 5 July revert date so programme managers do not assume permanent 2× session limits in capacity plans.
Anthropic doubled the five-hour Cowork usage bucket for eligible paid plans from 5 June through 5 July 2026 whilst leaving weekly caps unchanged.
Limit windows over the promotion
flowchart TB
START["5 Jun 2026 promo starts"]
SESSION["Five-hour rolling window resets continuously"]
DOUBLE["Cowork session allowance 2x"]
WEEKLY["Weekly cap unchanged"]
SHARED["Shared pool: Cowork chat and Code"]
END["5 Jul 2026 promo ends"]
START --> DOUBLE
DOUBLE --> SESSION
SESSION --> SHARED
SHARED --> WEEKLY
WEEKLY --> END
classDef agent fill:#8B0000,color:#fff
classDef tool fill:#189AB4,color:#fff
class DOUBLE agent
class WEEKLY tool
Timeline view: session windows roll continuously and temporarily widen for Cowork; the weekly ceiling and cross-product pool stay fixed.
Research supplement
Web search and page fetch tools were not available during this session. No additional reputable sources beyond those provided by the author could be verified. The sections above draw exclusively on the article text and the three reference URLs supplied (claude.com/product/cowork, support.anthropic.com/en/articles/9797557-usage-limit-best-practices, claude.com/pricing).
Microsoft’s 2026 Work Trend Index gives engineering leaders a vocabulary for human–agent collaboration and ships Copilot Cowork mobile, plugins, and Agent 365 so Frontier Firms can orchestrate work across Microsoft and third-party systems.
Field
Detail
Date
5 May 2026 (report and product wave); third-party Cowork plugins from 12 May 2026
Vendor
Microsoft
Product
2026 Work Trend Index; Microsoft 365 Copilot; Copilot Cowork; Microsoft Agent 365
Availability
WTI report on WorkLab; Cowork on iOS and Android; native Fabric and Dynamics 365 plugins GA; federated connectors GA (HubSpot, LSEG, Moody’s, Notion)
Pricing / limits
Report is free; Copilot stack via existing M365 Copilot and E7 SKUs—no new price point in this release
What changed
Microsoft named four collaboration patterns—Author, Editor, Director, and Orchestrator—and argued leaders must match workstreams to the right pattern rather than defaulting every process to multi-agent orchestration.
The 2026 Work Trend Index analysed trillions of anonymised Microsoft 365 signals and surveyed 20,000 AI-using knowledge workers across ten countries (February–April 2026).
49% of sampled Copilot chats support cognitive work; 58% of AI users produce work they could not a year ago, rising to 80% among Frontier Professionals.
Microsoft described a Transformation Paradox: 65% fear falling behind without AI, yet 45% prefer current goals over redesigning work, and only 13% feel rewarded for AI-driven reinvention.
Organisational factors—culture, manager support, talent practices—account for more than twice the reported AI impact of individual mindset (67% vs 32%).
Respondents map to five readiness zones: Frontier (19%), Blocked Agency (10%), Unclaimed Capacity (5%), Stalled (16%), and Emergent (50%).
Copilot Cowork Mobile launched on iOS and Android; native plugins for Dynamics 365 and Fabric are GA, with partner plugins (LSEG, Miro, monday.com, S&P Global Energy) rolling out.
Custom plugins let organisations codify internal workflows; federated Copilot connectors are GA in Researcher and Microsoft 365 Copilot Chat.
Microsoft Agent 365 is the control plane for governing, observing, and securing agents at scale, including visibility into local agents.
Why it matters for engineers
Platform teams often ship agents without changing incentives. The WTI data suggests most adoption friction is organisational, not model quality—skilled builders frequently land in Blocked Agency zones where legacy metrics punish workflow redesign. Pair agent rollouts with evaluation criteria that reward reinvention, not only throughput.
The four-pattern ladder is a practical safety taxonomy. Author and Editor modes suit low blast-radius tasks with human review on every artefact. Director mode needs job isolation, rollback, and audit trails. Orchestrator mode demands a control plane—Agent 365 in Microsoft’s stack—for connector scopes, identity, and exception routing. The same framing applies whether you build on Copilot or run Claude Code beside it.
Cowork’s plugin and connector model is the integration surface to design for: native first-party data (Fabric, Dynamics), packaged partner actions, and custom plugins for proprietary expertise. Federated connectors let agents read external knowledge without migrating data. That graph-of-connectors pattern is portable beyond M365.
Frontier Professionals—multi-step agent users who redesign workflows and publish team standards—are a benchmark for internal playbooks. They pause to allocate human versus AI work, deliberately practise skills without AI, and treat model output as draft material. Telemetry showing 49% of Copilot use in cognitive tasks suggests backlog priority belongs in analysis and synthesis features, not generic chat wrappers.
Frontier Firms redesign work around human–agent teams: people set goals and own accountability whilst agents execute repeatable analysis and orchestration.
Readiness zones at a glance
flowchart LR
subgraph lowOrg["Low organisational readiness"]
ST["Stalled 16%"]
EM["Emergent 50%"]
end
subgraph highOrg["High organisational readiness"]
UC["Unclaimed capacity 5%"]
FR["Frontier 19%"]
end
subgraph indiv["Individual capability"]
LO["Low"]
HI["High"]
end
BA["Blocked agency 10%"]
HI --> BA
BA --> lowOrg
FR --> highOrg
HI --> FR
LO --> ST
classDef agent fill:#8B0000,color:#fff
classDef tool fill:#189AB4,color:#fff
class FR agent
class BA tool
Matrix view: Frontier sits where individual skill and organisational support reinforce each other; Blocked Agency is the engineering-heavy zone where talent outruns incentives.
Research supplement
Web search and external page fetches were not available during this session (permissions not granted), so no additional sources could be verified. The following are factual claims from the article that would benefit from independent corroboration if this supplement is expanded in a future pass:
The 67% vs 32% organisational/individual split — the WTI methodology appendix (available at aka.ms/2026WorkTrendIndexAnnualReport) should be consulted to confirm how these figures were derived from the survey data.
Agent 365 GA and Microsoft 365 E7 SKU details — pricing and availability can be verified against the Tech Community announcement at the reference URL provided by the author.
Federated connector GA status — HubSpot, LSEG, Moody's, and Notion connector availability can be confirmed via the Microsoft 365 Copilot release notes.
Apple is extending Private Cloud Compute to Google Cloud NVIDIA GPU clusters so the heaviest Apple Intelligence workloads can run on third-party infrastructure without abandoning stateless, attestable privacy guarantees.
Field
Detail
Date
9 June 2026 (Apple Security Research blog)
Vendor
Apple — hosted on Google Cloud with NVIDIA and Intel silicon
Product
Private Cloud Compute (PCC) on Google Cloud for Apple Intelligence cloud inference
Availability
Summer 2026 preview with gradual ramp to full protection set; further detail at Confidential Computing Summit and in an updated PCC Security Guide
Pricing / limits
Consumer Apple Intelligence feature (no public API); security researchers gain binary inspection and bounty-programme access to research-mode nodes
What changed
PCC leaves Apple-only data centres. For the first time, Apple Intelligence cloud inference runs on Google Cloud systems, whilst Apple retains cryptographic control over which PCC software builds devices will trust.
New hardware trust stack. The implementation combines NVIDIA Confidential Computing GPUs, Intel CPUs with Trust Domain Extensions (TDX), and Google’s Titan security chip — replacing the Apple-silicon-only hosts used since PCC launched in 2024.
Foundation model collaboration. Apple worked with Google to apply Gemini-family techniques when building next-generation Apple Foundation Models; on-device tiers still handle lighter tasks, but agentic tool-use and complex reasoning target the cloud tier on NVIDIA hardware.
Supply-chain and attestation hardening. Apple maintains a cryptographically verifiable, append-only ledger of every Google Cloud machine in the PCC fleet. Components that could exfiltrate data if compromised are attested with at least two independent vendor roots of trust.
Architectural patterns carry over. Initial request parsing runs in a dedicated namespaced process; shared inference processes recycle on a short time-to-live; attested keys live in a separate confidential VM isolated from external inputs.
Transparency programme unchanged. PCC binaries remain published for public inspection, with research tooling and live research-mode nodes offered through the Apple Security Bounty Programme.
Why it matters for engineers
Confidential VMs and GPU encryption are now commodity cloud options. Apple’s claim is different: those primitives have not, until now, been composed into an end-to-end confidential inference pipeline that also ships public binaries and bounty-grade verification at global scale. PCC on Google Cloud is a reference for treating the entire stack — firmware through application code — as the trusted computing base, rather than trusting the guest VM boundary alone.
Platform teams building multi-tenant AI should study the operational patterns, not only the silicon. Stateless computation is enforced through short-lived inference workers and isolated parsers, reducing the blast radius if a host is misconfigured. Hardware inventory ledgers matter when you neither manufacture servers nor operate the facility: they convert supply-chain risk into auditable state. Dual roots of trust make it harder for a single vendor compromise to forge the entire attestation story.
For Apple Intelligence client engineers, the device-side contract is stable: only Apple-cryptographically-approved PCC releases execute, regardless of whether inference lands on Apple metal or a Google Cloud A3-class confidential GPU node. Preview ramp during summer 2026 means protection depth may converge over weeks — plan feature flags and telemetry accordingly until Apple declares parity with Apple-data-centre PCC.
Security researchers should watch the Confidential Computing Summit session and the forthcoming PCC Security Guide update for attestation quote formats, research-node access mechanics, and fleet geography. Until then, treat this announcement as architectural intent with preview availability, not a finished open inference API.
Apple Private Cloud Compute extends its privacy envelope to Google Cloud nodes using NVIDIA confidential GPUs, Intel TDX, and Titan-backed attestation.
flowchart LR
DEV["Apple device"]
TRUST["Apple-approved PCC client"]
NODE["Confidential cloud node"]
GPU["Stateless GPU inference"]
RESP["Encrypted response"]
DEV --> TRUST
TRUST --> NODE
NODE --> GPU
GPU --> RESP
RESP --> DEV
classDef agent fill:#8B0000,color:#fff
classDef tool fill:#189AB4,color:#fff
class NODE,GPU tool
class DEV,RESP agent
Amazon Bedrock now documents EU geographic cross-region inference profiles so teams in Europe can pool model capacity across Union Regions whilst keeping prompts and outputs inside a fixed EU routing boundary.
Field
Detail
Date
26 May 2026 (AWS Machine Learning blog)
Vendor
Amazon Web Services
Product
Amazon Bedrock — Cross-Region Inference (CRIS), EU system-defined inference profiles
Availability
Commercial Bedrock Regions; EU profiles route only to EU destination Regions (with London and Zurich source exceptions per AWS rules)
Pricing / limits
No separate routing fee; billed from source Region; global profiles offer ~10% savings on some models; inference profiles do not support Provisioned Throughput
What changed
Inference profile IDs replace plain model IDs. Applications opt into CRIS by passing system-defined profile strings such as eu.amazon.nova-2-lite-v1:0 (EU geographic) or global.amazon.nova-2-lite-v1:0 (global commercial) to Converse, InvokeModel, streaming APIs, batch jobs, Agents, and knowledge-base generation.
EU geographic profiles constrain destination Regions. All destinations in EU CRIS lie within the European Union. Requests from EU sources cannot be routed to non-EU commercial Regions whilst using an eu.* profile.
London and Zurich are special-cased. Sources in eu-west-2 may route among EU Regions plus London; eu-central-2 sources among EU Regions plus Zurich. Non-EU sources using EU profiles are optimised across the source Region and EU destinations only.
Geographic profile Region lists are static. AWS will publish a new inference profile ID rather than silently expanding an existing EU geography definition.
Audit fields ship in CloudTrail. Invocation metadata is logged in the customer source Region; additionalEventData.inferenceRegion records where Bedrock actually processed the request. Optional Model Invocation Logging keeps full payloads in the source Region only.
Compliance framing is explicit. The post ties CRIS to GDPR records-of-processing expectations, IAM least privilege, and Amazon Bedrock’s inclusion in the CISPE Data Protection Code of Conduct.
Why it matters for engineers
EU SaaS teams no longer choose between single-Region throttling and unaudited multi-Region sprawl. EU CRIS is a deliberate contract: your SDK client stays in a familiar source Region, but Bedrock may execute inference in another EU Region selected for capacity. Inter-Region traffic remains on the AWS private backbone with encryption in transit — a detail that matters when security reviewers ask whether prompts leave controlled networks.
The integration surface is small; the governance surface is not. IAM policies for geographic CRIS must grant bedrock:InvokeModel on the inference profile and on foundation-model ARNs in every destination Region listed for that profile, often conditioned on bedrock:InferenceProfileArn. Service Control Policies that block any destination Region in the profile will fail requests even when the source Region is allowed. Cross-Region inference can also target Regions you have not manually enabled — SCP design must allow the full destination set.
Operational teams should dashboard inferenceRegion alongside application metrics. That field supports data-protection impact assessments without enabling payload logging. When maximum throughput or ~10% cost savings outweigh residency constraints, global.* profiles remain available — but that is an explicit product decision, not a framework default.
Discover profiles via the Bedrock console cross-Region inference page, per-model Regional availability tables in the user guide, or list_inference_profiles(typeEquals='SYSTEM_DEFINED') from your source Region. Treat profile choice as architecture documentation: EU geographic for GDPR-aligned processing, global for performance-first workloads with accepted cross-border inference risk.
EU geographic Bedrock inference profiles keep prompts and outputs inside Union Regions whilst pooling capacity across EU destination Regions.
flowchart LR
APP["App in source Region"]
API["Bedrock runtime API"]
ROUTER{"CRIS profile router"}
DEST["Destination Region inference"]
RET["Response to source Region"]
APP --> API
API --> ROUTER
ROUTER --> DEST
DEST --> RET
RET --> APP
classDef agent fill:#8B0000,color:#fff
classDef tool fill:#189AB4,color:#fff
class ROUTER tool
class APP,RET agent
Research supplement
Web search was unavailable during production of this supplement; no additional external sources could be independently verified for this article. The CISPE Data Protection Code of Conduct certification status for Amazon Bedrock, referenced in the article, should be confirmed directly via the CISPE public register at cispe.cloud. The adequacy decision status for the UK and Switzerland under GDPR Article 45 — relevant to the London and Zurich source-Region edge cases — should be confirmed against current European Commission adequacy decisions, as adequacy status can be revoked or amended.
Xiaomi MiMo and TileRT shipped MiMo-V2.5-Pro-UltraSpeed, a trillion-parameter API tier that sustains roughly 1000 tokens per second decode on a single eight-GPU commodity node—aimed at agent builders who need frontier-scale models inside realtime loops.
Field
Detail
Date
8 June 2026
Vendor
Xiaomi MiMo + TileRT
Product
MiMo-V2.5-Pro-UltraSpeed API and trial chat
Availability
Application window 9–23 June 2026 (Beijing time); API at platform.xiaomimimo.com/ultraspeed
Pricing / limits
~3× MiMo-V2.5-Pro API price; ~10× decode speed; Token Plan not supported; chat trial capped (10 queues/day, 30 min/session)
What changed
1000+ tps on 1T MoE. Xiaomi claims the first public trillion-parameter decode above 1000 tokens per second using one standard eight-GPU server, via model–system co-design rather than custom wafer or SRAM-only silicon.
Selective FP4 on experts. MoE expert matrices quantise to FP4 (MXFP4) with quantisation-aware training; routers and attention stay higher precision to protect reasoning and code quality versus naive full-model FP4.
DFlash speculative decoding. Block-level masked parallel drafting replaces serial draft-token generation; reported acceptance lengths reach ~6.3 (coding), ~5.6 (maths/reasoning), and ~4.3 (agent) tokens per verification round with block size eight.
TileRT ultra-low-latency stack. Persistent engine kernels and warp-specialised pipelines cut microsecond execution gaps that dominate at kilohertz decode rates.
Open weights. Hugging Face release XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash ships FP4 weights plus DFlash draft parameters for offline study.
Gated trial. Approved users get free chat at ultraspeed.xiaomimimo.com during the promotion; enterprise partnerships via business-mimo@xiaomi.com.
Why it matters for engineers
Latency redefines what a trillion-parameter model can do. Below roughly ten tokens per second, 1T MoE models sit behind batch jobs and human-tolerated waits. Near 1000 tps, the same weights can participate in parallel Best-of-N search, sub-minute codegen sessions, or millisecond think–act loops in trading, fraud, and clinical triage—without downsizing to a 70B shortcut model.
The architectural lesson is co-design: bandwidth-bound expert matmuls shrink with FP4, serial decode expands via DFlash acceptance, and TileRT removes per-operator launch tax. Teams self-hosting open weights can benchmark the HuggingFace checkpoint on vLLM or SGLang; teams buying API capacity should measure cost per successful agent task during the June trial, not headline tokens per dollar alone.
Treat UltraSpeed as a latency SKU on MiMo-V2.5-Pro, not a new foundation family. Trial pricing and slots end 23 June 2026 unless extended; plan production fallbacks if FP4 quality drifts on your longest agent traces.
MiMo UltraSpeed stacks FP4 expert quantisation, DFlash speculative decoding, and TileRT persistent GPU pipelines to deliver roughly 1000 tokens per second from a one-trillion-parameter MoE on commodity hardware.
flowchart LR
A[Agent request] –> B[MiMo-V2.5-Pro 1T MoE]
B –> C[FP4 expert matmuls]
B –> D[DFlash draft block]
C –> E[TileRT persistent kernels]
D –> E
E –> F[~1000 tps token stream]
Research supplement
Web search was unavailable during this drafting session. No external sources could be verified. Recommend checking the following primary sources directly for corroboration: the TileRT technical post at tilert.ai detailing the kernel architecture and benchmark methodology; the Hugging Face model card for XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash for QAT methodology and reported eval scores; and the OCP Microscaling Formats specification for MXFP4 format details. Any third-party reproduction benchmarks on vLLM or SGLang that emerge after 9 June 2026 would materially strengthen or challenge the throughput claims.
Early June 2026 delivered one of the densest open-weight release windows on record — spanning chat models, image generation, speech, music, vision, video, and 3D. The roundup below maps 25+ notable drops across modalities, with specs drawn from official model cards and repos rather than hype alone.
%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
W[Open-weight release week] --> L[LLMs and MoE chat]
W --> I[Image DiT checkpoints]
W --> A[Audio TTS and ASR]
W --> V[Vision VLMs and OCR]
W --> M[Music and realtime audio]
W --> X[Video world and 3D]
L --> D[Deploy: MLX ONNX vLLM]
I --> D
A --> D
V --> D
M --> D
X --> D
classDef agent fill:#8B0000,color:#fff
classDef hook fill:#189AB4,color:#fff
class W agent
class L,I,A,V,M,X,D hook
Release density by modality — LLMs, image, audio, vision, video, and 3D all shipped open weights in the same window.
The surprise headline: Ideogram 4 shipped its first-ever open weights — a 9.3B flow-matching Diffusion Transformer (DiT) trained from scratch. Reported leaderboard placement: #2 overall behind GPT Image 2 on aggregate arenas, top open-weight on Design Arena and LMArena, with particular strength on text-rich layouts (posters, UI mockups, labelled diagrams).
Property
Ideogram 4 open
Architecture
9.3B DiT, flow matching, native 2K
Structured prompts
JSON with bounding boxes and colour palettes
Weights
Gated on Hugging Face (ideogram-ai/ideogram-4-nf4, FP8 variants)
License split
Apache 2.0 code; non-commercial weight agreement (commercial path via Ideogram)
Web search was unavailable during drafting of this post. The seven highlighted models are grounded in the author's provided reference links (Hugging Face model pages, official blogs, and GitHub repositories). No additional verified external sources could be confirmed for this supplement. Readers wishing to verify benchmark comparisons, licence terms, or capability claims should consult the original Hugging Face model cards and the official blog posts linked in the article's reference section directly.