Categories
News

Apple WWDC26: Siri AI and Gemini-Backed Foundation Models on Device and Private Cloud Compute

At WWDC26 on 8 June 2026, Apple previewed Siri AI and the next generation of Apple Intelligence on iOS 27, iPadOS 27, and macOS 27—powered by Apple Foundation Models built with Google Gemini and split across on-device Apple silicon and Private Cloud Compute.

FieldDetail
DateAnnounced 8 June 2026 (WWDC26)
VendorApple
ProductsSiri AI; Apple Intelligence across iOS/iPadOS/macOS/watchOS/visionOS 27
Model stackApple Foundation Models (Gemini collaboration); on-device + Private Cloud Compute
Developer frameworksFoundation Models framework (Swift, on-device + PCC + third-party LLMs); Core AI (custom PyTorch on Apple silicon); App Intents for Siri actions
AvailabilityDeveloper Program beta 8 June 2026 (iOS/iPadOS/macOS/visionOS); watchOS Siri beta later; public beta next month; user Siri beta English-first later in 2026; GA fall 2026
Hardware (base)iPhone 16+, iPhone 15 Pro/Max, iPad mini (A17 Pro), M1+ iPad/Mac, MacBook Neo, Vision Pro, Watch S9+/Ultra 2+/SE 3 with paired iPhone
Hardware (advanced on-device)iPhone Air, iPhone 17 Pro/Max, iPad (M4)+ with ≥12GB RAM, Mac (M3)+ with ≥12GB, Vision Pro (M5) — expressive voices, advanced dictation
Pricing / limitsServer-model features (e.g. photorealistic Image Playground) carry daily usage caps; expanded access on most iCloud+ plans (numeric quotas not published); compatible Home cameras included on qualifying iCloud+ tiers
Regional gatesEU: Siri AI on Mac and Vision Pro initially, not iOS/iPadOS/watchOS; China: unavailable pending regulatory work; Apple Intelligence supports 17 languages

What changed

  • Siri AI replaces the legacy assistant with personal-context search across Messages, Mail, and Photos; on-screen and Camera-mode awareness; expanded systemwide app actions; web-grounded answers; and a dedicated Siri app with iCloud-private conversation sync across iPhone, iPad, Mac, Watch, and Vision Pro.
  • Invocation surfaces expand beyond “Hey Siri” to Dynamic Island swipe (iPhone), Spotlight (iPad/Mac), control-click context menus, and Vision Pro look-to-speak with 3D visualisation.
  • On-device plumbing includes a system orchestrator, Spotlight index, and App Toolbox that keep personal-context processing local before escalating frontier workloads.
  • Apple Foundation Models are custom-built in collaboration with Google Gemini for deeply integrated experiences—not exposed as a raw Gemini API to consumers per Apple’s Intelligence announcement.
  • Hybrid execution runs models on device and on Private Cloud Compute; PCC retains Apple’s no-storage privacy promise with ongoing external verification.
  • Image Playground adds photorealistic generation on PCC with hidden SynthID watermarks; Photos gains Spatial Reframing and other on-device intelligence features.
  • Developer betas for Siri AI ship 8 June 2026 on iOS, iPadOS, macOS, and visionOS; watchOS follows in a future beta.

Developer integration surface

Foundation Models framework (Swift) is the primary LLM integration path: on-device sessions, Private Cloud Compute for frontier tasks, tool calling, Dynamic Profiles for multi-model routing, and third-party models via the Language Model protocol (Gemini, Claude, and others). Apple plans to open-source the framework core later in summer 2026. Use it when you want Apple-hosted intelligence inside your app without managing API keys or PCC authentication.

Core AI is a separate stack for deploying custom PyTorch models on Apple silicon—Python conversion tools, ahead-of-time compilation in Xcode, Swift inference APIs, and Core AI debugging instruments. Use Core AI when you bring your own weights; use Foundation Models when you consume Apple’s Foundation Models or attach approved third-party LLM providers.

App Intents and Spotlight integrations extend Siri AI personal context to third-party apps. View Annotations and on-screen-awareness APIs let apps participate in Siri’s screen-context flows without exposing raw screenshots to external model vendors.

Why it matters for engineers

Apple’s WWDC26 stack is a platform inference architecture, not a single model API. Builders should plan for dual execution paths: on-device Foundation Models for latency- and privacy-sensitive personal context, and PCC for frontier workloads (photorealistic image generation, broad world knowledge) with quota limits. This article covers the consumer Siri AI and developer framework launch; it is distinct from Apple’s PCC infrastructure expansion on Google Cloud NVIDIA hardware, which focused on attestation, fleet ledgers, and confidential-GPU hosting rather than Siri UX and App Intents.

Feature-detect against two hardware tiers before shipping voice or dictation features: the base Apple Intelligence list (iPhone 16+, M1+ Mac/iPad) differs from the advanced on-device model tier (M4+/M3+ with ≥12GB unified memory, iPhone 17 Pro family) required for expressive voices and advanced dictation.

Server-model daily caps and iCloud+ entitlements mean client apps must degrade gracefully when users exhaust allotments—Apple has not published numeric quotas, but photorealistic Image Playground and similar PCC-backed features are explicitly rate-limited. Enterprise Mac teams should plan fall GA as a coordinated OS 27 rollout with regional gates: EU iOS/iPadOS Siri AI is deferred whilst Mac and Vision Pro proceed.

For teams comparing hyperscaler assistants: Apple exposes no raw Gemini or Claude endpoint. Capabilities arrive through Foundation Models framework sessions and Siri AI system channels—simplifying privacy review but limiting custom prompt engineering relative to direct API integrations.

On-device Siri personal context versus Private Cloud Compute frontier models

Personal-context Siri workloads stay on Apple silicon; frontier models run in Private Cloud Compute without storing user prompts.

Intelligence routing at WWDC26

flowchart TB
  USER["User or app request"]
  LOCAL["On-device Foundation Models"]
  PCC["Private Cloud Compute"]
  ANS["Response to user"]
  USER --> LOCAL
  LOCAL -->|"personal context"| ANS
  LOCAL -->|"frontier workload"| PCC
  PCC --> ANS

Research supplement

Web search was unavailable during production of this post. The following notes flag external sources worth checking to deepen specific claims in the article — all URLs listed are from the author's own reference set and are not newly discovered sources.

  • PCC architecture and security model: Apple first published technical documentation on Private Cloud Compute at WWDC24 and via its security research blog. Readers seeking the external verification mechanism referenced in this article should consult Apple's current security documentation for any updates since the original 2024 PCC white paper.
  • SynthID watermarking: SynthID is Google DeepMind's AI content watermarking standard. Its appearance in Apple's Image Playground outputs is a direct consequence of the Gemini collaboration. DeepMind's public SynthID documentation would clarify the detection and verification process for watermarked outputs.
  • App Intents and Core AI framework evolution: The Core AI framework reference at developer.apple.com/documentation/coreai (author reference #3) is the authoritative current source for developer integration details; readers building for iOS 27 should treat this as primary documentation over any third-party summary.
---

References

Categories
News

Claude Fable 5 and Mythos 5: Anthropic Ships Mythos-Class Model With Opus Fallback Safeguards

Anthropic shipped Claude Fable 5 on 9 June 2026—a Mythos-class frontier model for general use with classifier fallbacks to Claude Opus 4.8 on sensitive cyber, biology, and distillation queries—alongside restricted Claude Mythos 5 access for Project Glasswing defenders and separate biology trusted-access programmes.

Short video walkthrough

Engineering walkthrough — ElevenLabs narration, HeyGen bookends, API vs claude.ai defaults, and official Anthropic B-roll (~6 min).

FieldDetail
DateGeneral availability 9 June 2026
VendorAnthropic
ProductsClaude Fable 5 (GA); Claude Mythos 5 (Glasswing cyber partners only)
API model IDclaude-fable-5 (Mythos 5 has no general API ID)
AvailabilityAPI and consumption-based Enterprise: full access from launch; claude.ai and third-party surfaces; subscription plans staged through 22 June 2026
Pricing$10/M input tokens, $50/M output tokens (less than half Mythos Preview)
Subscription windowIncluded on Pro, Max, Team, and seat-based Enterprise through 22 June 2026; usage credits from 23 June until capacity allows reinclusion
SafeguardsCyber, bio/chem, and distillation classifiers route to Opus 4.8 with user notification; triggers in <5% of sessions on average (>95% run Fable with Mythos-equivalent performance)
Data retention30-day retention on Mythos-class business traffic (first- and third-party surfaces); not used for training; human access logged

What changed

  • Claude Fable 5 is Anthropic’s first Mythos-class model generally available, with state-of-the-art scores on software engineering, knowledge work, vision, and long-horizon agent benchmarks—lead grows as tasks become longer and more complex per the launch post.
  • New safety classifiers extend constitutional-classifier work: cyber (exploitation plus offensive agentic hacking), biology/chemistry (broad fallback during launch), and distillation (large-scale capability extraction) all route flagged prompts to Claude Opus 4.8 instead of refusals.
  • Claude Mythos 5 shares Fable 5 weights with cyber safeguards lifted for existing Project Glasswing partners upgrading from Mythos Preview; comparable or stronger performance at substantially lower cost.
  • Biology trusted access (separate from Mythos 5) will offer Fable 5 with bio/chem classifiers removed but cyber classifiers still active to a small life-sciences cohort—broader enrolment planned as safeguards narrow.
  • Pricing halved versus Mythos Preview on API and consumption-based Enterprise plans.
  • 30-day retention is required for Mythos-class business traffic to detect novel jailbreaks; data deleted after 30 days with logged human access (Anthropic support article).
  • Red-team validation: external bug bounty reported no universal jailbreak in 1,000+ hours; zero compliance on harmful single-turn cyber requests across 30 public jailbreak techniques in partner testing.
  • Subscription rollout is demand-sensitive: included at no extra cost on paid Claude plans through 22 June 2026, then usage credits until capacity stabilises.

Capability evidence for builders

  • Software engineering: Stripe reported a 50-million-line Ruby migration in one day (versus an estimated two-plus months manually); Cognition’s FrontierCode ranks Fable 5 highest among frontier models at medium effort with improved token efficiency.
  • Knowledge work: highest score on Hebbia’s Finance Benchmark; IMC reported near-perfect trading-analysis results across factual lookup, root-cause analysis, and expected-value reasoning.
  • Vision: state-of-the-art on vision tasks; completed Pokémon FireRed vision-only without navigation harnesses that prior Claude models required.
  • Memory: on Slay the Spire agent runs, file-based memory produced threefold improvement versus Opus 4.8 and threefold higher final-act completion rates.
  • Alignment: automated assessments place Mythos 5 misaligned behaviour similar to Opus 4.8 per the system card.

Why it matters for engineers

Teams wiring production agents must treat Fable 5 as a two-model endpoint: more than 95% of sessions never trigger fallback, but cyber-hardening, bioinformatics, or suspicious bulk-extraction patterns transparently downgrade to Opus 4.8 with user notification. Log response metadata and surface fallback events to operators—latency and capability profiles differ, and conservative classifier tuning means benign security research queries can still trip safeguards during the launch window.

The API and consumption-based Enterprise path is the reliable integration surface from day one. Subscription inclusion is time-boxed and demand-sensitive; capacity planning for long autonomous coding runs should prefer metered API tiers. Mythos 5 remains outside general API access—cyber defenders need Glasswing or a future trusted-access application; biology researchers follow the separate Fable-without-bio-classifiers programme.

Long-context and file-backed memory improvements matter for multi-hour agent loops: Fable 5 sustains focus across millions of tokens and benefits disproportionately from persistent notes versus Opus 4.8. Vision-only harnesses now complete screenshot-to-code and scientific-figure extraction tasks that previously required scaffolding.

Regulated workloads must account for 30-day Mythos-class retention on business traffic, logged human access to stored prompts, and explicit prohibition on training use. Benchmark harnesses that resemble distillation attacks may trigger classifiers—design eval pipelines to tolerate Opus 4.8 fallbacks or isolate test traffic from production API keys.

Frontier model with automatic safe fallback when classifiers route sensitive queries to Opus 4.8

Most Fable 5 sessions run at full frontier capability; cyber, biology, and distillation classifiers route sensitive prompts to Opus 4.8 instead of blocking.

Classifier fallback in production

flowchart LR
  REQ["Agent or app request"]
  CLS["Safety classifiers"]
  FABLE["Fable 5 response"]
  OPUS["Opus 4.8 fallback"]
  OUT["Answer delivered"]
  REQ --> CLS
  CLS -->|"typical workload"| FABLE
  CLS -->|"cyber bio distillation"| OPUS
  FABLE --> OUT
  OPUS --> OUT

Research supplement

Web search was not available in this environment. The following context is drawn from the article and linked reference materials only.

The classifier-fallback approach described in Fable 5 relates to broader AI safety literature on output filtering versus refusal. Anthropic's published safety work (ASL-3 and higher commitments) has flagged cyber and CBRN (chemical, biological, radiological, nuclear) as priority dual-use categories — the three Fable 5 classifier domains (cyber, bio/chem, distillation) map directly onto these commitments. The system card cited in the article (claude-fable-5-mythos-5-system-card) is the primary source for evaluating classifier accuracy claims independently.

Project Glasswing is described at anthropic.com/glasswing as a defenders-focused initiative; the article does not reproduce its full scope. Engineers evaluating Mythos 5 access should consult that page directly for enrollment criteria.

The API model ID (claude-fable-5) and current pricing are listed in Anthropic's models overview at platform.claude.com/docs/en/about-claude/models/overview, which is the authoritative source for integration and should be checked against the article's stated rates before capacity planning.

References

Categories
News

Google Colab CLI: Provision GPUs and Run Scripts from Your Terminal

Google Colab CLI turns Colab from a browser-only notebook into a programmable remote runtime you drive from your terminal — provision a T4 or A100, pipe a local .py file to a Jupyter kernel in the cloud, pull checkpoints back, and tear the VM down, without opening a tab. Google shipped it in June 2026 as an agent-ready bridge between local dev machines and Colab compute.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
  T[Local terminal] -->|colab new / exec| API[Colab assign API]
  API -->|runtime proxy token| VM[Remote Colab VM]
  VM --> K[Jupyter kernel]
  K --> GPU[GPU or TPU]
  VM -->|colab download| A[Local artifacts]
  API -->|keep-alive 60s| VM

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class T,A agent
  class API,VM,K,GPU hook

What problem it solves

Before the CLI, Colab meant: open a notebook in Chrome, click Connect, upload files manually, and babysit the runtime. That breaks down for shell pipelines, CI-style jobs, and coding agents that only speak bash. The CLI exposes the same rented VMs through commands like colab new --gpu T4, colab exec -f train.py, and colab run --gpu T4 train.py — a one-shot provision → execute → teardown path.

Google’s launch post positions it for both humans and agents: any tool with terminal access (Claude Code, Codex, Antigravity, etc.) can provision accelerators, install packages with uv, run local scripts remotely, export replayable .ipynb logs, and download weights — without writing cloud provisioning code yourself.

How the architecture works

LayerWhat it doesWhere it lives
CLI (Typer)Commands, session names, authYour Mac or Linux machine
Assign APIAllocate VM, return endpoint + proxy tokencolab.research.google.com/tun/m/assign
Keep-alive daemonPing every 60s; 24h capDetached local process per session
Jupyter kernelExecute Python via WebSocketRemote VM (/content cwd)
Contents APIUpload/download/list filesSame VM via Jupyter HTTP
Local stateSession metadata, kernel id~/.config/colab-cli/sessions.json

Important detail: colab exec -f script.py reads the file locally and sends source to the kernel — you do not need a separate upload step for execution. Use colab upload / colab download for datasets, checkpoints, and zips.

Install and authenticate

# Recommended
uv tool install google-colab-cli

# Or pip (requires Python 3.13+)
pip install google-colab-cli

# Quick smoke test
colab new
echo "print('Hello from Colab')" | colab exec
colab stop

Two auth layers matter:

  • CLI → Colab control plane--auth oauth2 (browser flow, token in ~/.config/colab-cli/token.json) or --auth adc (Application Default Credentials — preferred for agents).
  • VM → GCP servicescolab auth inside a session for BigQuery/GCS; separate from CLI login.
# Agent-friendly ADC setup (all required scopes)
gcloud auth application-default login \
  --scopes=openid,\
https://www.googleapis.com/auth/cloud-platform,\
https://www.googleapis.com/auth/userinfo.email,\
https://www.googleapis.com/auth/colaboratory

colab --auth=adc whoami
colab --auth=adc new -s my-job

colab new pre-flights scopes: if the colaboratory scope is missing, it unassigns the fresh VM and prints remediation — avoiding silent 403s mid-job.

Command map

GroupCommandJob
Sessioncolab new [-s NAME] [--gpu GPU] [--tpu TPU]Allocate VM + start keep-alive
Sessioncolab sessions / colab status -s NAMEList / inspect hardware + IDLE/BUSY
Sessioncolab stop -s NAMEKill daemon, shutdown kernel, release VM
Sessioncolab url -s NAME [--open]Browser link to attach to CLI session
Executecolab exec [-s NAME] [-f FILE] [--output-image PATH]Run stdin, .py, or .ipynb
Executecolab repl / colab consoleInteractive Python or raw tmux shell
Executecolab run [--gpu GPU] [--keep] script.py [args]One-shot new + exec + stop
Filescolab upload / download / ls / rm / editJupyter Contents API wrappers
VM setupcolab install [-r requirements.txt] PKG...uv pip install --system (falls back to pip)
VM setupcolab drivemount / colab authDrive + GCP creds (interactive)
Logscolab log -o run.ipynbExport history as ipynb/md/jsonl
Agentcolab skillPrint bundled COLAB_SKILL.md

GPU and TPU options

FlagAcceleratorTypical use
(none)CPULight scripts, orchestration tests
--gpu T4NVIDIA T4Fine-tuning, inference smoke tests
--gpu L4NVIDIA L4Efficient inference/training
--gpu G4NVIDIA G4Graphics/ML workloads
--gpu A100NVIDIA A100Large-model training
--gpu H100NVIDIA H100Top-tier training (tier-gated)
--tpu v5e1TPU v5eTPU-native JAX/Flax jobs
--tpu v6e1TPU v6eNewer TPU slice

Accelerator access is subscription- and quota-gated. HTTP 400 on colab new --gpu X usually means no entitlement — fall back to T4 or CPU. Unrecognized --gpu values silently map to A100 in the client; spell GPU names exactly.

Built for coding agents

Five-step agent workflow with Colab CLI from provision to cleanup
The CLI ships COLAB_SKILL.md via colab skill — agents get session rules, safe commands, and ADC auth without scraping the README.

Google’s Gemma fine-tuning demo is the canonical agent pattern:

colab new --gpu T4
colab install transformers datasets peft trl bitsandbytes accelerate
colab exec -f finetune_run.py
colab download checkpoints/adapter ./adapter
colab log --output gemma_finetune_log.ipynb
colab stop

Agent-safe: new, stop, exec (piped/file), run, install, upload, download, log. Agent-unsafe (TTY): unpiped repl/console, auth, drivemount.

For parallel jobs, isolate state: colab --config /tmp/job-a.json new -s trainer-a. Always name sessions and call colab stop — idle VMs burn compute units even with keep-alive.

Shebang one-liners with colab run

#!/usr/bin/env -S colab run --gpu L4 --keep
import torch
print(torch.cuda.is_available(), torch.cuda.get_device_name(0))

chmod +x script.py && ./script.py provisions a fresh VM, runs the script with forwarded sys.argv, propagates exit codes, and tears down unless --keep is set. CLI status messages go to stderr; script stdout stays clean for piping.

Three workflows that cover most jobs

1. Training with checkpoint pull

colab new -s trainer --gpu A100
colab install -s trainer torch transformers
colab exec -s trainer -f train.py
colab download -s trainer checkpoints/model.bin ./model.bin
colab stop -s trainer

2. Local notebook on cloud kernel

colab new -s analysis
colab exec -s analysis -f report.ipynb   # writes report_output.ipynb locally
colab log -s analysis -o execution_log.md
colab stop -s analysis

3. Fire-and-forget GPU job

colab run --gpu T4 train.py --epochs 3 --lr 1e-4

Hybrid tip: colab url -s NAME --open attaches the browser UI to a CLI-provisioned VM — explore in the notebook, automate in the shell.

CLI vs browser-only Colab

Browser notebookColab CLI
InterfaceCells, widgets, plots inlineTerminal, scripts, CI, agents
Session startConnect button in tabcolab new / colab run
Keep-aliveBrowser activity (~90 min idle)Detached daemon (24h cap)
File syncManual upload UIupload/download + exec without upload
AutomationLimited headlessNative pipelines, shebang, agent loops
Agent pathColab MCP Server (in-notebook)COLAB_SKILL.md + bash tools
PlatformAny browserLinux and macOS only (no Windows yet)

Limits and footguns

ConstraintImpactMitigation
Python 3.13+ requiredOlder system Python won’t installUse uv tool install
Compute unitsBillable while VM runscolab stop; use run for ephemeral jobs
Default exec timeout 30sLong training may look “hung”Pass --timeout on exec/run
Kernel persistsState leaks between exec callsrestart-kernel or fresh session
Interactive commandsBlock agentsPipe stdin or use exec -f
GPU quota400 on assignFall back CPU/T4; check colab pay

Performance summary

DimensionBefore CLIWith Colab CLI
Provision GPUBrowser connect + UI clickscolab new --gpu T4 from shell
Run local code remotelyUpload + paste cellscolab exec -f script.py
One-shot jobsManual lifecyclecolab run or shebang
Agent integrationCustom Selenium / MCP onlyBundled COLAB_SKILL.md
Artifact recoveryManual download UIcolab download + colab log
Headless keep-aliveTab must stay open60s daemon, no browser
Packagepip in cellscolab install via uv
Latest releasev0.5.9 (PyPI, Jun 2026)

Research supplement

Web search was unavailable in this environment. The research supplement is left empty pending external verification of specific Colab CLI documentation, authentication details, and quota behaviour.

---

References

Categories
News

Loop Engineering: Design Coding-Agent Systems Instead of Prompting Every Turn

Loop engineering means you stop being the person who types every prompt to a coding agent — and start designing a small system that discovers work, delegates it, checks it, remembers progress, and repeats. The leverage moves from prompt craft to loop design: six primitives that now ship inside tools like Claude Code and the Codex app instead of bespoke bash you maintain forever.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  subgraph Stack["Three layers"]
    H[Harness engineering] --> L[Loop engineering]
    L --> O[Orchestration layer]
  end
  H -->|one agent runtime| T[Tools memory sandbox]
  L -->|schedule + verify| P[Six primitives]
  O -->|fleet + PR lifecycle| R[Reactions state machine]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class H,L,O agent
  class T,P,R hook

Where the conversation landed in 2026

The shift is no longer niche. Boris Cherny, who leads Claude Code at Anthropic, described it on the Acquired podcast as: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figure out what to do. My job is to write loops.” Peter Steinberger put the same idea on X: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Both are saying the human job moved up one floor — from typing each turn to designing feedback systems.

That floor has three names in practice. Harness engineering is the runtime around one agent (tools, memory, permissions). Loop engineering is the harness that runs on a schedule, spawns helpers, and feeds itself from disk. Orchestration is the layer above when you need fleets of agents across worktrees, PRs, and CI — with automatic routing of failures back to the right session.

The universal five-stage cycle

Five stages of a coding agent loop: discover plan execute verify iterate
Every serious loop — single agent or fleet — runs the same cycle until a verifiable stop condition holds.
StageWhat happensTypical tooling
DiscoverFind work: CI failures, issues, diffs, inboxAutomations, /loop, triage skills
PlanBreak goal into steps with constraintsSkills, VISION.md, spec sub-agent
ExecuteEdit code, run tools, open PRsWorktrees, MCP connectors
VerifyPush against objective signals — not model opinionTests, lint, /goal evaluator, critic sub-agent
IterateFix gaps and loop againStop hooks, reactions, state file

A prompt gives instructions for one turn. A loop gives a job: discover → plan → execute → verify → iterate until done. You set the goal; the loop runs itself.

Open loops vs closed loops

Open loopClosed loop
NatureExploratory; wide search spaceBounded path you designed
RiskToken burn; “slop machine” without gatesCheaper; predictable
NeedsLarge budget + strong evaluatorsClear goal, defined steps, stop condition
Start here?Research spikes, benchmarksProduction coding, triage, migrations

Closed loops need five ingredients on disk: goal (precise done), context (VISION.md, ARCHITECTURE.md, RULES.md), action (scoped tools), feedback (tests, lint, structured errors), and a stop condition (/goal text, Stop hook, or orchestrator brief). Without a quality gate, AI drifts; with one, it improves.

Single-agent loop vs fleet loop

Single-agent loopFleet loop
ShapeOne brain runs discover→verify end-to-endOrchestrator splits work across specialists
Good forFocused refactors, /goal migrationsLarge features, parallel PRs, research→build→QA chains
Token profile~50K–200K tokens per medium coding task~500K–2M+ when orchestrator + 3+ specialists run
Example splitExplore → implement → verify sub-agentsResearch specialist → engineering specialist → QA specialist, each with its own loop

What changed in agentic development

For roughly two years, “good AI coding” meant writing strong prompts and feeding enough context each turn. You typed, read, typed again — the agent was a power tool and you held the handle every step.

Loop engineering is the next layer: a recursive goal where you define purpose and done, and the system iterates until a verifiable condition holds. You design once; the loop pokes agents on a schedule or across turns. This sits one floor above agent harness engineering (the environment one agent runs in) and the factory model (the system that builds software) — same family of ideas, but the harness now runs on a timer, spawns helpers, and feeds itself from disk-based memory.

The six primitives every loop needs

Six building blocks of a coding agent loop
Five action primitives plus persistent state — the shape is the same across major coding-agent products.
#PrimitiveJob in the loopWithout it
1AutomationsScheduled discovery and triageYou manually check CI, issues, and diffs
2WorktreesIsolate parallel agent checkoutsTwo agents overwrite the same files
3SkillsProject knowledge on disk (SKILL.md)Agent re-guesses conventions every run
4Connectors (MCP)Issues, DB, Slack, staging APIsAgent only sees the filesystem
5Sub-agentsSeparate maker and checker rolesOne model grades its own homework
6State / memoryMarkdown, Linear board, AGENTS.mdModel forgets between runs; loop restarts blind

The agent forgets; the repo does not. Long-running loops depend on external state — not context window — to remember what was tried, what passed, and what is next. Common context files beyond SKILL.md: VISION.md (what success looks like), ARCHITECTURE.md (stack and layout), RULES.md (forbidden actions), GUARDRAILS.md (always-on checklists), and AGENTS.md (repo map for agents).

Codex app vs Claude Code — same shape, different names

PrimitiveCodex appClaude Code
AutomationsAutomations tab: project, prompt, cadence, local or worktree env; Triage inbox; thread vs standalone runs/loop, Desktop scheduled tasks, Cloud Routines (/schedule), hooks, GitHub Actions
WorktreesBuilt-in per threadgit worktree, --worktree, isolation: worktree on subagents
SkillsSKILL.md, invoke with $name or /skillsSame SKILL.md folder format; bundled /loop, /code-review
ConnectorsMCP connectors + pluginsMCP servers + plugins; routine connectors on claude.ai
Sub-agentsTOML in .codex/agents/.claude/agents/, agent teams
StateMarkdown / Linear via connector; thread memoryAGENTS.md, progress files, prd.json-style task queues

Once you see the shared shape, the debate shifts from “which tool” to “which loop design still works in either seat.”

1. Automations — the heartbeat

Automations turn a one-off agent run into a loop. In the Codex app you configure project, prompt, schedule, and environment (local checkout or background worktree). Runs with findings land in a Triage inbox; empty runs archive themselves. Internal uses include daily issue triage, CI failure summaries, commit briefings, and regression hunts. Automations can call $skill-name so recurring logic stays maintainable.

Claude Code reaches the same outcome via /loop (interval reruns), cron scheduling, lifecycle hooks, Desktop scheduled tasks (persistent while app is open), Cloud Routines (runs when laptop is closed), or GitHub Actions for headless runs.

Interactive pick: /goal vs /loop vs Stop hooks

MechanismNext turn starts when…Stops when…Best for
/goal (Claude)Previous turn finishesSeparate evaluator model confirms condition (reads transcript only)Migrations, refactors, “all tests green”
/goal (Codex)Thread idle after turnEvidence in thread supports completion; pause/resume/clear/budgetMulti-hour tuning, benchmarks, long refactors
/loopTime interval elapsesYou stop it or agent decides donePolling deploys, periodic summaries, PR babysitting
Stop hookPrevious turn finishesYour script, prompt hook, or agent hook decidesRalph-style loops, org-wide completion rules
# Claude Code — run until tests and lint are clean (v2.1.139+)
/goal all tests in test/auth pass and the lint step is clean

# Check spend and evaluator reasoning
/goal

# Stop early
/goal clear

# Headless single invocation
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"

# Codex — long-running performance goal (cookbook pattern)
/goal Reduce p95 checkout latency below 120 ms, verified by the checkout benchmark,
while keeping the correctness suite green. If blocked, stop with evidence.

/goal on Claude Code starts a turn immediately; after each turn Haiku (by default) judges yes/no from the transcript — it does not run tools. Codex /goal is thread-scoped with explicit budget accounting and pause/resume. Pair either with auto mode so each turn skips per-tool confirmations.

2. Worktrees — parallel without collisions

Two agents editing the same file is the same failure mode as two engineers on one branch without coordination. A git worktree is a separate working directory on its own branch, sharing history but not files. Codex threads use worktrees natively; Claude Code offers --worktree sessions and isolation: worktree on subagents that clean up after themselves.

Worktrees remove mechanical collision; your review bandwidth still caps how many parallel agents you can actually supervise.

3. Skills — stop paying intent debt every session

Agents start cold. Every missing convention becomes a confident guess — intent debt. A skill is intent written outside the chat: a folder with SKILL.md, optional scripts, references, and assets. Both Codex and Claude Code load skills when you invoke $name or when the task matches a tight, boring description (clever descriptions match too often).

# Example skill layout
my-project-skill/
  SKILL.md          # conventions, build steps, forbidden patterns
  scripts/
  references/

Skill vs plugin: the skill is the authoring format; a plugin bundles skills and connectors for teammates to install once.

4. Connectors — act in your real environment

MCP connectors let the loop read Linear/Jira, query databases, hit staging APIs, and post to Slack. That is the difference between “here is the fix” and “open the PR, link the ticket, ping the channel when CI is green.” Plugins package connectors with skills so onboarding is one install, not tribal memory.

Feedback signals that keep loops honest

Hierarchy of agent loop feedback signals from tests to self-critique
A loop with nothing to push against is just the agent agreeing with itself — layer deterministic, perceptual, and critic signals.
Signal typeExamplesStrength
Deterministic oraclesCI, unit tests, type checks, linters, git diff, scalar metrics (e.g. benchmark p95)Strongest — pass/fail without model judgment
Perceptual / visualPlaywright, browser MCP tools, layout screenshotsMedium — catches UI regressions code tests miss
Critic sub-agentsSeparate reviewer agent; forces retry or stopMedium — judgment, but not the worker context
Persistent contextGUARDRAILS.md, skills, checklists loaded every runAlways-on oracle
LLM self-critique only“Does this look good?” from same modelWeakest — rationalises its own mistakes

Strongest systems stack multiple signal types: deterministic for reliability, visual/critic for judgment, human gates on high-stakes merges. Signals must route back automatically — full logs, diffs, scores — without you copy-pasting CI output each turn.

5. Sub-agents — maker vs checker

Maker agent and checker agent split in a coding loop
The highest-leverage split: implement in one agent, verify in another — including /goal’s separate done-evaluator.

The model that wrote the code is too lenient grading itself. A second agent — different instructions, sometimes a different model — catches rationalised mistakes. Typical trio: explore, implement, verify against spec. In fleet setups, a validator agent reports truth without fixing — failures loop back to the builder.

# Codex — custom subagent (simplified .codex/agents/security-reviewer.toml)
name = "security-reviewer"
description = "Read-only security pass on diffs"
instructions = "Find auth, injection, and secret-leak risks. No edits."
model = "strong"
reasoning_effort = "high"

Sub-agents cost extra tokens (each runs its own model + tools). Spend them where a second opinion unlocks unattended runs — the only reason you can walk away from a loop.

Orchestration — when one loop is not enough

Single-session /goal loops solve “finish this migration without me re-prompting.” Fleet-scale work needs an orchestration layer: deterministic plumbing plus an orchestrator agent for judgment.

LayerJobExamples
Deterministic plumbingRoute environmental feedback automaticallyCI fail → inject logs into worker session; PR conflict → notify right agent; lifecycle state machine (working → ci_failed → review_pending → merged)
Orchestrator agentDecompose goals, write briefs, batch parallel workResearch agent → spec → tracking issue → N workers in isolated worktrees
Human gatesVision, acceptance, high-risk mergesTriage inbox, PR approval — optimise human time, not remove humans

Open-source reference implementations like Agent Orchestrator (npm install -g @aoagents/ao) ship reactions engines, worktree isolation, and orchestrator prompts out of the box. The pattern: inner agents execute in bounded loops; outer orchestrator coordinates; environmental signals keep loops honest; you stay on vision and judgment.

Walkthrough: one morning triage loop

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
sequenceDiagram
  participant Auto as Morning automation
  participant Skill as Triage skill
  participant State as STATE.md
  participant WT as Worktree
  participant Maker as Fix sub-agent
  participant Check as Review sub-agent
  participant MCP as Connectors

  Auto->>Skill: Run on schedule
  Skill->>State: Write CI failures + issues
  loop Each actionable item
    Auto->>WT: Open isolated checkout
    WT->>Maker: Draft fix
    Maker->>Check: Submit diff
    Check-->>Maker: Approve or reject
    Maker->>MCP: Open PR + update ticket
  end
  Auto->>State: Log done / blocked for human inbox
  • 06:00 — Automation fires; triage skill reads yesterday’s CI, open issues, recent commits.
  • Findings — Written to STATE.md or a Linear board (memory outside the chat).
  • Per item — New worktree → maker sub-agent drafts fix → checker sub-agent runs against project skills + tests.
  • Ship — Connectors open PR and update tickets; blocked items land in your inbox.
  • Tomorrow — State file tells the loop what was tried, passed, or still open.

You designed this once. You did not prompt each step — that is the whole point.

Prompt engineer vs loop engineer

Prompt engineerLoop engineer
Crafts better instructions per turnDesigns feedback cycles and stop conditions
Linguistic skillSystems / software engineering skill
Better single outputReliable verified outcomes across runs
You review manually each timeSystem self-corrects against oracles
You are the feedback loopThe loop is the feedback loop
“Write me a function”“Write → test → fix until green”

Self-check: is your loop healthy?

QuestionHealthy loopLeaky loop
What proves “done”?Tests, lint, measurable condition in /goalAgent says “looks good”
Where does memory live?Repo file or issue trackerOnly in chat context
Who verifies?Separate sub-agent or evaluator modelSame agent that wrote code
What pushes back?Layered oracles (CI + critic + human gate)Self-critique only
Parallelism?One worktree per agentShared checkout
Token budget?Turn cap in condition or manual clearOpen-ended overnight /goal
Your role?Review merged outcomes you understandPress go and hope

What loops do not remove — three sharper risks

Verification stays human

An unattended loop is also an unattended mistake machine. Even with a verifier sub-agent, “done” is a claim, not proof. Ship code you confirmed works — especially when diff sizes balloon because agents touch more files than necessary.

Comprehension debt accelerates

The faster the loop ships code you did not write, the wider the gap between what exists and what you understand. Read the reasoning, skim the diff, trace the decision log — or the loop makes the debt grow faster, not slower.

Cognitive surrender

When automation feels smooth, it is tempting to stop having opinions. Loop design with judgement keeps you the engineer; loop design to avoid thinking is the same UI with opposite outcomes. Two teams can run identical loops — one moves faster on work they deeply understand; the other outsources understanding entirely. The loop cannot tell the difference. You can.

Parallel pattern: scheduled content factories

The same week loop engineering went mainstream for coding, creators published parallel “factory” playbooks for media. @0x_fokki’s X Article I Built an AI Animation Factory That Runs 24/7 is not a coding-agent harness — Claude is used as a scriptwriter, not a repo editor — but it shows the same design move: stop hand-driving each step, design a pipeline that runs on a schedule with human approval gates.

Coding loop and content factory share the same scheduled pipeline shape
Same loop instinct in two domains — you design the system and the gates, not every intermediate prompt.

Fokki’s pipeline chains six tools end-to-end:

Claude → Midjourney → Runway → ElevenLabs → Suno → Make
script → frames → motion → voice → music → publish

One Make scenario runs Monday and Thursday at 08:00: pull scripts from Google Drive, batch Midjourney scene prompts, download frames, send dialogue to ElevenLabs, pair images with Runway motion clips, assemble in a CapCut template, upload to YouTube with generated metadata, clip a 30-second X preview, post Patreon early access, and ping Telegram on completion. A separate on-demand webhook turns client briefs into finished explainers in shared Drive — quoted turnaround ~6 hours after a one-time ~5-hour setup.

Four SKUs share the pipeline: animated story series (6–10 min), brand explainers (60–90 sec), motion comics, and children’s bedtime channels. The human job is narrow: pick the story, pick the style, approve the output — roughly four hours of direction for a “24/7” factory, per the author.

Loop-engineering primitiveFokki factory analogueKey difference
AutomationsMake.com schedule + webhookNo /goal or hooks — cron-style triggers only
Skills / context on diskReusable Midjourney character sheets, CapCut templates, voice cast notesCreative consistency prompts, not SKILL.md
Sub-agent splitTool specialization per stage (script vs frames vs motion)No verifier sub-agent — human approves final cut
ConnectorsDrive, YouTube, Patreon, Telegram APIsDistribution stack, not MCP issue trackers
Feedback signalViews, RPM, client acceptanceBusiness metrics — not CI, lint, or test gates
State / memoryOrganised Drive folders per episodeAsset library, not AGENTS.md

What transfers to coding loops

  • Scheduled heartbeat — the factory does not wait for you to open a chat; neither should triage or CI-repair loops.
  • Stage-specialised tools — one brain trying to script, illustrate, animate, and score is the creative version of one agent grading its own code.
  • Performance direction in prompts — Fokki writes ElevenLabs stage direction (pauses, volume drops), not raw dialogue paste; coding loops need equally explicit done conditions in /goal text.
  • Human gate on output — “approve the episode” maps to Triage inbox review and PR merge — optimise human time, do not remove judgment.
  • Setup once, run indefinitely — the Make scenario is the media equivalent of wiring automations + skills once, then letting the loop compound.

Treat revenue figures in social factory posts as illustrative, not audited benchmarks. The architectural lesson is stable: factories — code or content — are designed loops with explicit stages, schedules, and gates. Coding loop engineering just demands harder oracles (tests, type checks, diffs) because “shipped” is easier to fake than “sounds convincing.”

Token economics and balance

PatternApproximate token loadMitigation
Single-agent medium coding loop50K–200K per runTurn caps in /goal; cheaper model for explore/review
Fleet (orchestrator + 3 specialists)500K–2M+ per cycleBatch only parallelisable work; stuck detection
Scheduled daily automationMillions per week if always-onArchive empty runs; scope skills tightly
Sub-agents + /goal evaluatorMultiplicative per child sessionSpend sub-agents on high-risk paths only

Loops are not free — patterns diverge wildly if you are “token rich” vs “token poor.” Direct prompting still matters for ambiguity and architecture. Loops handle repetition; you handle judgement. The leverage point moved — it did not disappear.

Performance summary

DimensionPrompt eraLoop era
Your jobWrite each turnDesign discover → plan → execute → verify → remember
Core cycleAsk → answerFive stages until verifiable done
PrimitivesContext + prompt6 shared building blocks (both major tools)
Done signalYou decide to stop/goal evaluator, Stop hook, or environmental oracles
ScaleOne threadWorktrees + sub-agents + orchestration layer
FeedbackYour eyesLayered oracles — not self-critique alone
KnowledgeRe-explained each sessionSkills + VISION.md / AGENTS.md compound
Risk profileSlower, more oversightFaster, higher verification + comprehension debt
Bottom lineBuild the loop — stay the engineer who reviews what ships

Research supplement

The following documentation pages from the official Claude Code docs provide additional technical depth beyond the article's reference links:

  • Scheduled Tasks (/loop): The Scheduled Tasks reference details how /loop works alongside cloud Routines and Desktop scheduled tasks, including the full comparison table of scheduling options, jitter behaviour, seven-day expiry, and the loop.md customisation mechanism. Notably, dynamic /loop schedules can use the Monitor tool internally to stream background process output, avoiding polling entirely.
  • Agent Loop Architecture: The Agent SDK: How the agent loop works page documents the full turn-and-message lifecycle, context window management, automatic compaction, and how max_turns / maxBudgetUsd bounds apply. It also explains how subagents start with a fresh conversation context, which has direct implications for keeping loop context efficient over long runs.

Key technical detail not in the primary reference links: The /goal command is implemented as a session-scoped prompt-based Stop hook. This means developers who need evaluation logic beyond a short text condition (for example, running an actual script to verify state) can write a custom Stop hook instead — which gives them the same turn-by-turn evaluation model with full scripting power.

---

References

Categories
News

Anthropic Doubles Claude Cowork 5-Hour Limits Through July 2026

Anthropic doubled Claude Cowork’s five-hour session rate limits for Pro, Max, and Team subscribers from 5 June through 5 July 2026, leaving weekly caps and the shared quota across Claude products unchanged.

FieldDetail
DateAnnounced 5 June 2026; promotion through 5 July 2026
VendorAnthropic
ProductClaude Cowork (desktop knowledge-work agent)
AvailabilityClaude Pro, Max, and Team paid plans; Cowork only—not Claude Code or chat-specific boosts
Pricing / limits2× five-hour rolling session allowance; weekly usage cap static; quota shared with Claude.ai and Claude Code

What changed

  • Boris Cherny, who leads Claude Code at Anthropic, announced the promotion on 5 June 2026 via social post—no dedicated article appeared on the Anthropic newsroom index by 9 June 2026.
  • Claude Cowork five-hour rolling session limits are doubled for approximately one month, ending 5 July 2026.
  • Eligible plans: Claude Pro, Claude Max, and Claude Team.
  • The change applies to five-hour rate-limit windows only—Anthropic’s weekly usage cap is unchanged.
  • Claude Code and Claude.ai retain standard session limits; the promotion is Cowork-specific.
  • Subscription quota remains a shared pool across Claude surfaces—heavier Cowork bursts can still exhaust the weekly budget faster.

Why it matters for engineers

Anthropic meters paid plans with two leaky buckets: a five-hour rolling session window for burst fairness and a weekly cap for cost control. Doubling only the first bucket optimises long desktop agent runs—folder reorganisation, batch report generation, scheduled digests—without raising Anthropic’s weekly compute exposure. Teams scheduling Cowork jobs should treat the promotion as session headroom, not unlimited capacity.

Cowork is not the Claude API. It runs in the desktop app with filesystem and Office integration, autonomous loops, and user approval gates—ideal for knowledge-worker delegation, unsuitable for production services. Engineers should keep CI and production agents on API metering while pilots use Cowork inside the promo window for deferred “messy folder” projects Cherny highlighted.

Unified quota across Cowork, Claude Code, and web chat means platform leads need allocation policy. A seat running heavy Code sessions the same week as a doubled Cowork migration may hit the unchanged weekly ceiling before the session window resets. Monitor Settings → Usage for both progress bars before kicking off multi-hour agent tasks.

Enterprise admins already manage Cowork feature access and org spend caps separately from consumer tiers. Communicate the 5 July revert date so programme managers do not assume permanent 2× session limits in capacity plans.

Doubled five-hour Cowork usage window for Pro Max and Team plans

Anthropic doubled the five-hour Cowork usage bucket for eligible paid plans from 5 June through 5 July 2026 whilst leaving weekly caps unchanged.

Limit windows over the promotion

flowchart TB
  START["5 Jun 2026 promo starts"]
  SESSION["Five-hour rolling window resets continuously"]
  DOUBLE["Cowork session allowance 2x"]
  WEEKLY["Weekly cap unchanged"]
  SHARED["Shared pool: Cowork chat and Code"]
  END["5 Jul 2026 promo ends"]
  START --> DOUBLE
  DOUBLE --> SESSION
  SESSION --> SHARED
  SHARED --> WEEKLY
  WEEKLY --> END
  classDef agent fill:#8B0000,color:#fff
  classDef tool fill:#189AB4,color:#fff
  class DOUBLE agent
  class WEEKLY tool

Timeline view: session windows roll continuously and temporarily widen for Cowork; the weekly ceiling and cross-product pool stay fixed.

Research supplement

Web search and page fetch tools were not available during this session. No additional reputable sources beyond those provided by the author could be verified. The sections above draw exclusively on the article text and the three reference URLs supplied (claude.com/product/cowork, support.anthropic.com/en/articles/9797557-usage-limit-best-practices, claude.com/pricing).

References

Categories
News

Microsoft 2026 Work Trend Index: How Frontier Firms Orchestrate Human-Agent Teams

Microsoft’s 2026 Work Trend Index gives engineering leaders a vocabulary for human–agent collaboration and ships Copilot Cowork mobile, plugins, and Agent 365 so Frontier Firms can orchestrate work across Microsoft and third-party systems.

FieldDetail
Date5 May 2026 (report and product wave); third-party Cowork plugins from 12 May 2026
VendorMicrosoft
Product2026 Work Trend Index; Microsoft 365 Copilot; Copilot Cowork; Microsoft Agent 365
AvailabilityWTI report on WorkLab; Cowork on iOS and Android; native Fabric and Dynamics 365 plugins GA; federated connectors GA (HubSpot, LSEG, Moody’s, Notion)
Pricing / limitsReport is free; Copilot stack via existing M365 Copilot and E7 SKUs—no new price point in this release

What changed

  • Microsoft named four collaboration patterns—Author, Editor, Director, and Orchestrator—and argued leaders must match workstreams to the right pattern rather than defaulting every process to multi-agent orchestration.
  • The 2026 Work Trend Index analysed trillions of anonymised Microsoft 365 signals and surveyed 20,000 AI-using knowledge workers across ten countries (February–April 2026).
  • 49% of sampled Copilot chats support cognitive work; 58% of AI users produce work they could not a year ago, rising to 80% among Frontier Professionals.
  • Microsoft described a Transformation Paradox: 65% fear falling behind without AI, yet 45% prefer current goals over redesigning work, and only 13% feel rewarded for AI-driven reinvention.
  • Organisational factors—culture, manager support, talent practices—account for more than twice the reported AI impact of individual mindset (67% vs 32%).
  • Respondents map to five readiness zones: Frontier (19%), Blocked Agency (10%), Unclaimed Capacity (5%), Stalled (16%), and Emergent (50%).
  • Copilot Cowork Mobile launched on iOS and Android; native plugins for Dynamics 365 and Fabric are GA, with partner plugins (LSEG, Miro, monday.com, S&P Global Energy) rolling out.
  • Custom plugins let organisations codify internal workflows; federated Copilot connectors are GA in Researcher and Microsoft 365 Copilot Chat.
  • Microsoft Agent 365 is the control plane for governing, observing, and securing agents at scale, including visibility into local agents.

Why it matters for engineers

Platform teams often ship agents without changing incentives. The WTI data suggests most adoption friction is organisational, not model quality—skilled builders frequently land in Blocked Agency zones where legacy metrics punish workflow redesign. Pair agent rollouts with evaluation criteria that reward reinvention, not only throughput.

The four-pattern ladder is a practical safety taxonomy. Author and Editor modes suit low blast-radius tasks with human review on every artefact. Director mode needs job isolation, rollback, and audit trails. Orchestrator mode demands a control plane—Agent 365 in Microsoft’s stack—for connector scopes, identity, and exception routing. The same framing applies whether you build on Copilot or run Claude Code beside it.

Cowork’s plugin and connector model is the integration surface to design for: native first-party data (Fabric, Dynamics), packaged partner actions, and custom plugins for proprietary expertise. Federated connectors let agents read external knowledge without migrating data. That graph-of-connectors pattern is portable beyond M365.

Frontier Professionals—multi-step agent users who redesign workflows and publish team standards—are a benchmark for internal playbooks. They pause to allocate human versus AI work, deliberately practise skills without AI, and treat model output as draft material. Telemetry showing 49% of Copilot use in cognitive tasks suggests backlog priority belongs in analysis and synthesis features, not generic chat wrappers.

Human-agent operating model shift in Frontier Firms

Frontier Firms redesign work around human–agent teams: people set goals and own accountability whilst agents execute repeatable analysis and orchestration.

Readiness zones at a glance

flowchart LR
  subgraph lowOrg["Low organisational readiness"]
    ST["Stalled 16%"]
    EM["Emergent 50%"]
  end
  subgraph highOrg["High organisational readiness"]
    UC["Unclaimed capacity 5%"]
    FR["Frontier 19%"]
  end
  subgraph indiv["Individual capability"]
    LO["Low"]
    HI["High"]
  end
  BA["Blocked agency 10%"]
  HI --> BA
  BA --> lowOrg
  FR --> highOrg
  HI --> FR
  LO --> ST
  classDef agent fill:#8B0000,color:#fff
  classDef tool fill:#189AB4,color:#fff
  class FR agent
  class BA tool

Matrix view: Frontier sits where individual skill and organisational support reinforce each other; Blocked Agency is the engineering-heavy zone where talent outruns incentives.

Research supplement

Web search and external page fetches were not available during this session (permissions not granted), so no additional sources could be verified. The following are factual claims from the article that would benefit from independent corroboration if this supplement is expanded in a future pass:

  • The 67% vs 32% organisational/individual split — the WTI methodology appendix (available at aka.ms/2026WorkTrendIndexAnnualReport) should be consulted to confirm how these figures were derived from the survey data.
  • Agent 365 GA and Microsoft 365 E7 SKU details — pricing and availability can be verified against the Tech Community announcement at the reference URL provided by the author.
  • Federated connector GA status — HubSpot, LSEG, Moody's, and Notion connector availability can be confirmed via the Microsoft 365 Copilot release notes.

References

Categories
News

Apple Private Cloud Compute on Google Cloud: NVIDIA GPUs with Verifiable Privacy

Apple is extending Private Cloud Compute to Google Cloud NVIDIA GPU clusters so the heaviest Apple Intelligence workloads can run on third-party infrastructure without abandoning stateless, attestable privacy guarantees.

FieldDetail
Date9 June 2026 (Apple Security Research blog)
VendorApple — hosted on Google Cloud with NVIDIA and Intel silicon
ProductPrivate Cloud Compute (PCC) on Google Cloud for Apple Intelligence cloud inference
AvailabilitySummer 2026 preview with gradual ramp to full protection set; further detail at Confidential Computing Summit and in an updated PCC Security Guide
Pricing / limitsConsumer Apple Intelligence feature (no public API); security researchers gain binary inspection and bounty-programme access to research-mode nodes

What changed

  • PCC leaves Apple-only data centres. For the first time, Apple Intelligence cloud inference runs on Google Cloud systems, whilst Apple retains cryptographic control over which PCC software builds devices will trust.
  • New hardware trust stack. The implementation combines NVIDIA Confidential Computing GPUs, Intel CPUs with Trust Domain Extensions (TDX), and Google’s Titan security chip — replacing the Apple-silicon-only hosts used since PCC launched in 2024.
  • Foundation model collaboration. Apple worked with Google to apply Gemini-family techniques when building next-generation Apple Foundation Models; on-device tiers still handle lighter tasks, but agentic tool-use and complex reasoning target the cloud tier on NVIDIA hardware.
  • Supply-chain and attestation hardening. Apple maintains a cryptographically verifiable, append-only ledger of every Google Cloud machine in the PCC fleet. Components that could exfiltrate data if compromised are attested with at least two independent vendor roots of trust.
  • Architectural patterns carry over. Initial request parsing runs in a dedicated namespaced process; shared inference processes recycle on a short time-to-live; attested keys live in a separate confidential VM isolated from external inputs.
  • Transparency programme unchanged. PCC binaries remain published for public inspection, with research tooling and live research-mode nodes offered through the Apple Security Bounty Programme.

Why it matters for engineers

Confidential VMs and GPU encryption are now commodity cloud options. Apple’s claim is different: those primitives have not, until now, been composed into an end-to-end confidential inference pipeline that also ships public binaries and bounty-grade verification at global scale. PCC on Google Cloud is a reference for treating the entire stack — firmware through application code — as the trusted computing base, rather than trusting the guest VM boundary alone.

Platform teams building multi-tenant AI should study the operational patterns, not only the silicon. Stateless computation is enforced through short-lived inference workers and isolated parsers, reducing the blast radius if a host is misconfigured. Hardware inventory ledgers matter when you neither manufacture servers nor operate the facility: they convert supply-chain risk into auditable state. Dual roots of trust make it harder for a single vendor compromise to forge the entire attestation story.

For Apple Intelligence client engineers, the device-side contract is stable: only Apple-cryptographically-approved PCC releases execute, regardless of whether inference lands on Apple metal or a Google Cloud A3-class confidential GPU node. Preview ramp during summer 2026 means protection depth may converge over weeks — plan feature flags and telemetry accordingly until Apple declares parity with Apple-data-centre PCC.

Security researchers should watch the Confidential Computing Summit session and the forthcoming PCC Security Guide update for attestation quote formats, research-node access mechanics, and fleet geography. Until then, treat this announcement as architectural intent with preview availability, not a finished open inference API.

Apple PCC privacy envelope extended to Google Cloud NVIDIA confidential compute

Apple Private Cloud Compute extends its privacy envelope to Google Cloud nodes using NVIDIA confidential GPUs, Intel TDX, and Titan-backed attestation.

flowchart LR
    DEV["Apple device"]
    TRUST["Apple-approved PCC client"]
    NODE["Confidential cloud node"]
    GPU["Stateless GPU inference"]
    RESP["Encrypted response"]

    DEV --> TRUST
    TRUST --> NODE
    NODE --> GPU
    GPU --> RESP
    RESP --> DEV

    classDef agent fill:#8B0000,color:#fff
    classDef tool fill:#189AB4,color:#fff
    class NODE,GPU tool
    class DEV,RESP agent

References

Categories
News

Amazon Bedrock EU Cross-Region Inference: GDPR-Aligned Model Routing for Engineers

Amazon Bedrock now documents EU geographic cross-region inference profiles so teams in Europe can pool model capacity across Union Regions whilst keeping prompts and outputs inside a fixed EU routing boundary.

FieldDetail
Date26 May 2026 (AWS Machine Learning blog)
VendorAmazon Web Services
ProductAmazon Bedrock — Cross-Region Inference (CRIS), EU system-defined inference profiles
AvailabilityCommercial Bedrock Regions; EU profiles route only to EU destination Regions (with London and Zurich source exceptions per AWS rules)
Pricing / limitsNo separate routing fee; billed from source Region; global profiles offer ~10% savings on some models; inference profiles do not support Provisioned Throughput

What changed

  • Inference profile IDs replace plain model IDs. Applications opt into CRIS by passing system-defined profile strings such as eu.amazon.nova-2-lite-v1:0 (EU geographic) or global.amazon.nova-2-lite-v1:0 (global commercial) to Converse, InvokeModel, streaming APIs, batch jobs, Agents, and knowledge-base generation.
  • EU geographic profiles constrain destination Regions. All destinations in EU CRIS lie within the European Union. Requests from EU sources cannot be routed to non-EU commercial Regions whilst using an eu.* profile.
  • London and Zurich are special-cased. Sources in eu-west-2 may route among EU Regions plus London; eu-central-2 sources among EU Regions plus Zurich. Non-EU sources using EU profiles are optimised across the source Region and EU destinations only.
  • Geographic profile Region lists are static. AWS will publish a new inference profile ID rather than silently expanding an existing EU geography definition.
  • Audit fields ship in CloudTrail. Invocation metadata is logged in the customer source Region; additionalEventData.inferenceRegion records where Bedrock actually processed the request. Optional Model Invocation Logging keeps full payloads in the source Region only.
  • Compliance framing is explicit. The post ties CRIS to GDPR records-of-processing expectations, IAM least privilege, and Amazon Bedrock’s inclusion in the CISPE Data Protection Code of Conduct.

Why it matters for engineers

EU SaaS teams no longer choose between single-Region throttling and unaudited multi-Region sprawl. EU CRIS is a deliberate contract: your SDK client stays in a familiar source Region, but Bedrock may execute inference in another EU Region selected for capacity. Inter-Region traffic remains on the AWS private backbone with encryption in transit — a detail that matters when security reviewers ask whether prompts leave controlled networks.

The integration surface is small; the governance surface is not. IAM policies for geographic CRIS must grant bedrock:InvokeModel on the inference profile and on foundation-model ARNs in every destination Region listed for that profile, often conditioned on bedrock:InferenceProfileArn. Service Control Policies that block any destination Region in the profile will fail requests even when the source Region is allowed. Cross-Region inference can also target Regions you have not manually enabled — SCP design must allow the full destination set.

Operational teams should dashboard inferenceRegion alongside application metrics. That field supports data-protection impact assessments without enabling payload logging. When maximum throughput or ~10% cost savings outweigh residency constraints, global.* profiles remain available — but that is an explicit product decision, not a framework default.

Discover profiles via the Bedrock console cross-Region inference page, per-model Regional availability tables in the user guide, or list_inference_profiles(typeEquals='SYSTEM_DEFINED') from your source Region. Treat profile choice as architecture documentation: EU geographic for GDPR-aligned processing, global for performance-first workloads with accepted cross-border inference risk.

EU geographic inference profiles routing Bedrock requests within Union Regions

EU geographic Bedrock inference profiles keep prompts and outputs inside Union Regions whilst pooling capacity across EU destination Regions.

flowchart LR
    APP["App in source Region"]
    API["Bedrock runtime API"]
    ROUTER{"CRIS profile router"}
    DEST["Destination Region inference"]
    RET["Response to source Region"]

    APP --> API
    API --> ROUTER
    ROUTER --> DEST
    DEST --> RET
    RET --> APP

    classDef agent fill:#8B0000,color:#fff
    classDef tool fill:#189AB4,color:#fff
    class ROUTER tool
    class APP,RET agent

Research supplement

Web search was unavailable during production of this supplement; no additional external sources could be independently verified for this article. The CISPE Data Protection Code of Conduct certification status for Amazon Bedrock, referenced in the article, should be confirmed directly via the CISPE public register at cispe.cloud. The adequacy decision status for the UK and Switzerland under GDPR Article 45 — relevant to the London and Zurich source-Region edge cases — should be confirmed against current European Commission adequacy decisions, as adequacy status can be revoked or amended.

References

Categories
News

MiMo-V2.5-Pro-UltraSpeed: 1T Model at 1000 Tokens Per Second on Commodity GPUs

Xiaomi MiMo and TileRT shipped MiMo-V2.5-Pro-UltraSpeed, a trillion-parameter API tier that sustains roughly 1000 tokens per second decode on a single eight-GPU commodity node—aimed at agent builders who need frontier-scale models inside realtime loops.

FieldDetail
Date8 June 2026
VendorXiaomi MiMo + TileRT
ProductMiMo-V2.5-Pro-UltraSpeed API and trial chat
AvailabilityApplication window 9–23 June 2026 (Beijing time); API at platform.xiaomimimo.com/ultraspeed
Pricing / limits~3× MiMo-V2.5-Pro API price; ~10× decode speed; Token Plan not supported; chat trial capped (10 queues/day, 30 min/session)

What changed

  • 1000+ tps on 1T MoE. Xiaomi claims the first public trillion-parameter decode above 1000 tokens per second using one standard eight-GPU server, via model–system co-design rather than custom wafer or SRAM-only silicon.
  • Selective FP4 on experts. MoE expert matrices quantise to FP4 (MXFP4) with quantisation-aware training; routers and attention stay higher precision to protect reasoning and code quality versus naive full-model FP4.
  • DFlash speculative decoding. Block-level masked parallel drafting replaces serial draft-token generation; reported acceptance lengths reach ~6.3 (coding), ~5.6 (maths/reasoning), and ~4.3 (agent) tokens per verification round with block size eight.
  • TileRT ultra-low-latency stack. Persistent engine kernels and warp-specialised pipelines cut microsecond execution gaps that dominate at kilohertz decode rates.
  • Open weights. Hugging Face release XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash ships FP4 weights plus DFlash draft parameters for offline study.
  • Gated trial. Approved users get free chat at ultraspeed.xiaomimimo.com during the promotion; enterprise partnerships via business-mimo@xiaomi.com.

Why it matters for engineers

Latency redefines what a trillion-parameter model can do. Below roughly ten tokens per second, 1T MoE models sit behind batch jobs and human-tolerated waits. Near 1000 tps, the same weights can participate in parallel Best-of-N search, sub-minute codegen sessions, or millisecond think–act loops in trading, fraud, and clinical triage—without downsizing to a 70B shortcut model.

The architectural lesson is co-design: bandwidth-bound expert matmuls shrink with FP4, serial decode expands via DFlash acceptance, and TileRT removes per-operator launch tax. Teams self-hosting open weights can benchmark the HuggingFace checkpoint on vLLM or SGLang; teams buying API capacity should measure cost per successful agent task during the June trial, not headline tokens per dollar alone.

Treat UltraSpeed as a latency SKU on MiMo-V2.5-Pro, not a new foundation family. Trial pricing and slots end 23 June 2026 unless extended; plan production fallbacks if FP4 quality drifts on your longest agent traces.

FP4 and DFlash accelerating trillion-parameter MoE decode on commodity GPUs

MiMo UltraSpeed stacks FP4 expert quantisation, DFlash speculative decoding, and TileRT persistent GPU pipelines to deliver roughly 1000 tokens per second from a one-trillion-parameter MoE on commodity hardware.

flowchart LR A[Agent request] –> B[MiMo-V2.5-Pro 1T MoE] B –> C[FP4 expert matmuls] B –> D[DFlash draft block] C –> E[TileRT persistent kernels] D –> E E –> F[~1000 tps token stream]

Research supplement

Web search was unavailable during this drafting session. No external sources could be verified. Recommend checking the following primary sources directly for corroboration: the TileRT technical post at tilert.ai detailing the kernel architecture and benchmark methodology; the Hugging Face model card for XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash for QAT methodology and reported eval scores; and the OCP Microscaling Formats specification for MXFP4 format details. Any third-party reproduction benchmarks on vLLM or SGLang that emerge after 9 June 2026 would materially strengthen or challenge the throughput claims.

References

Categories
News

Open-Weight AI Release Week: 25+ Models Across LLMs, Image, Audio, Video, and 3D (June 2026)

Early June 2026 delivered one of the densest open-weight release windows on record — spanning chat models, image generation, speech, music, vision, video, and 3D. The roundup below maps 25+ notable drops across modalities, with specs drawn from official model cards and repos rather than hype alone.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  W[Open-weight release week] --> L[LLMs and MoE chat]
  W --> I[Image DiT checkpoints]
  W --> A[Audio TTS and ASR]
  W --> V[Vision VLMs and OCR]
  W --> M[Music and realtime audio]
  W --> X[Video world and 3D]

  L --> D[Deploy: MLX ONNX vLLM]
  I --> D
  A --> D
  V --> D
  M --> D
  X --> D

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class W agent
  class L,I,A,V,M,X,D hook
Open-weight AI releases grouped by modality for one busy week
Release density by modality — LLMs, image, audio, vision, video, and 3D all shipped open weights in the same window.

Large language models and edge chat

ModelOrgKey specsWhy it matters
Nemotron 3 UltraNVIDIA550B hybrid Mamba–MoE; 55B active; 1M context; 89.1 MMLU; NVFP4 variant ~5× throughput on BlackwellFirst openly weighted 550B hybrid Mamba–Transformer; datacenter agentic scale with ~10% active params
Gemma 4 12BGoogleEncoder-free any-to-any (text/image/audio/video); 256k context; 140+ languages; AIME 2026 77.5; 23-checkpoint QAT wave (mobile ONNX + MLX)Most deployable multimodal open model of the week — laptop-class with Apache 2.0 weights
LFM2.5-8B-A1BLiquid AIEdge MoE; ~1.5B active; 128k ctx; MATH500 88.8; MLX-readyStrong on-device math/reasoning per active parameter
Mellum2-12B-A2.5B-ThinkingJetBrainsFirst open JetBrains MoE; 2.5B active (8 of 64 experts); 131k ctx; LiveCodeBench v6 69.9; Apache 2.0Near–Qwen3-14B coding quality at much lower active width for IDE/agent tooling

Links: Nemotron 3 Ultra · Gemma 4 12B · LFM2.5-8B-A1B · Mellum2 Thinking

Image generation — Ideogram 4 open weights

The surprise headline: Ideogram 4 shipped its first-ever open weights — a 9.3B flow-matching Diffusion Transformer (DiT) trained from scratch. Reported leaderboard placement: #2 overall behind GPT Image 2 on aggregate arenas, top open-weight on Design Arena and LMArena, with particular strength on text-rich layouts (posters, UI mockups, labelled diagrams).

PropertyIdeogram 4 open
Architecture9.3B DiT, flow matching, native 2K
Structured promptsJSON with bounding boxes and colour palettes
WeightsGated on Hugging Face (ideogram-ai/ideogram-4-nf4, FP8 variants)
License splitApache 2.0 code; non-commercial weight agreement (commercial path via Ideogram)

Link: Ideogram 4.0 technical blog · Hugging Face collection

Audio, speech, and music — four TTS labs in one week

ModelOrgHighlights
Higgs Audio v3 TTS 4BBoson AI100+ languages; inline emotion/style/prosody tags; singing/whisper/shout; sub-second time-to-first-audio; 8-codebook AR decoder + 24 kHz output
dots.ttsrednote hilab2B fully continuous AR TTS — no discrete codec tokens; 48 kHz AudioVAE; Qwen2.5-1.5B backbone; Apache 2.0
Magenta RealTime 2GoogleReal-time music generation; <200 ms latency; text + audio + MIDI conditioning; community PyTorch port with live ZeroGPU demos within hours
Nemotron-3.5 ASRNVIDIA600M streaming ASR; 17× more concurrent streams vs Parakeet RNNT 1.1B in NVIDIA benchmarks

Links: Higgs Audio v3 · dots.tts · Magenta RealTime 2 · Nemotron ASR via NVIDIA HF

Vision, VLMs, and document AI

ModelOrgHighlights
Step-3.7-FlashStepFun198B sparse MoE VLM; ~11B active; SWE-Bench PRO 56.3; Apache 2.0
PaddleOCR-VL-1.6PaddlePaddleSOTA document parsing at 1B params; Apache 2.0
NAVABaidu6.3B joint audio–video generation; strong A/V sync in reported evals; Apache 2.0

Video, world models, and 3D

ModelOrgHighlights
Cosmos3-SuperNVIDIA64B physical-AI omnimodel (32B reasoner + 32B generator); couples action trajectories with video+audio gen; OpenMDW 1.1 on Hugging Face
JoyAI-EchoJDUp to 5-minute multi-shot text-to-video on LTX-2.3 stack
Bernini-RByteDanceOpen video/reconstruction line (companion to VAST releases)
VAST TripoSplatByteDance VASTSingle-image → 3D Gaussian splats; MIT license

Links: Cosmos3-Super · NVIDIA Cosmos 3 blog · nvidia/Cosmos

Glossary — abbreviations from the roundup

TermMeaning
MoEMixture-of-Experts — sparse activation; only a subset of parameters run per token (powers many frontier chat models)
QATQuantization-Aware Training — train so weights compress cleanly to INT4/FP8 for phones and laptops
MMLUMassive Multitask Language Understanding — broad knowledge benchmark for LLMs
ONNXCross-platform model format common in production inference
MLXApple’s framework for running models on M-series chips
DiTDiffusion Transformer — transformer backbone inside modern image/video generators

How to navigate the flood

  • Laptop / phone: Gemma 4 12B QAT, LFM2.5-8B MLX, Mellum2 for coding agents.
  • Design & posters: Ideogram 4 open DiT for text-heavy layouts.
  • Voice products: Higgs v3 for expressive tags; dots.tts for fully continuous Apache 2.0 pipelines.
  • Robotics / sim: Cosmos3-Super for multimodal world + action reasoning.
  • Datacenter LLM: Nemotron 3 Ultra + NVFP4 on Blackwell for throughput.

Watch the week in 60 seconds

Video: demonstration from Niels Rogge on LinkedIn.

Release-week summary

MetricValue
Notable open-weight drops25+ across modalities
Largest LLMNemotron 3 Ultra — 550B total / 55B active
Most deployable multimodalGemma 4 12B — encoder-free, QAT/MLX wave
Biggest image surpriseIdeogram 4 — first open 9.3B DiT weights
TTS breakout4 labs (Higgs, dots.tts, Magenta RT2, Nemotron ASR)
Physical AI flagshipCosmos3-Super — 64B omnimodel
Fastest streaming ASR claimNemotron-3.5 ASR — 17× streams vs Parakeet 1.1B

Research supplement

Web search was unavailable during drafting of this post. The seven highlighted models are grounded in the author's provided reference links (Hugging Face model pages, official blogs, and GitHub repositories). No additional verified external sources could be confirmed for this supplement. Readers wishing to verify benchmark comparisons, licence terms, or capability claims should consult the original Hugging Face model cards and the official blog posts linked in the article's reference section directly.

---

References