Categories
News

Gemini Interactions API Explained: Google’s Primary Interface for Models and Agents

Google’s Gemini Interactions API reached general availability on 22 June 2026 and is now the primary interface for Gemini models and agents — replacing generateContent as the default in Google AI Studio, official docs, and new frontier agent features.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    A[Developer] --> B[Interactions API]
    B --> C{model or agent?}
    C -->|model ID| D[Gemini inference]
    C -->|agent ID| E[Managed Agent sandbox]
    D --> F[Typed steps timeline]
    E --> F
    B --> G[background=true]
    G --> H[Async Deep Research / agents]
    F --> I[previous_interaction_id]
    I --> J[Server-side state 55-day retention]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class B hook
    class E agent
    class D agent
    class H agent
    class C decision

What GA means for Gemini developers

The Interactions API launched in public beta in December 2025. GA brings a stable schema, Managed Agents with remote Linux sandboxes, background execution, typed execution steps, Flex/Priority service tiers, and documentation that defaults to Interactions everywhere. generateContent remains supported for mainline models, but long-running agent capabilities will increasingly ship exclusively on Interactions.

DimensiongenerateContent (legacy)Interactions API (GA)
StateStateless — client manages historyServer-side by default via previous_interaction_id
Response shapecandidates[].content.partsStored interaction with steps[] timeline
AgentsManual orchestrationPass agent ID — Antigravity, Deep Research, custom
Long tasksClient polling workaroundsbackground=true native async
StreamingSeparate :streamGenerateContent endpointSame endpoint with stream: true

One endpoint: model inference or managed agent

The core design: pass a model ID for inference or an agent ID for autonomous multi-step work. Same SDK call shape, same endpoint (POST /v1beta/interactions).

from google import genai

client = genai.Client()

# Talk to a model
interaction = client.interactions.create(
    model="gemini-3.5-flash",
    input="Explain quantum entanglement simply.",
)

# Run a managed agent
interaction = client.interactions.create(
    agent="antigravity-preview-05-2026",
    input="Plot solar energy growth globally and make HTML slides.",
    environment="remote",
)
Diagram showing one Interactions API endpoint accepting either a model ID or agent ID
Model for inference, agent for autonomous tasks — one unified API surface.

Key GA capabilities since December beta

FeatureWhat it does
Managed AgentsOne API call provisions remote Linux sandbox — reason, execute code, browse web, manage files
Background executionbackground=true — server runs interaction asynchronously; cancel via /interactions/{id}/cancel
Tool mixingCombine Google Search, Google Maps, and custom functions in one request; tool results can return images
Deep ResearchSpeed vs depth agent versions, collaborative planning, native charts/infographics, multimodal grounding (images, PDFs, audio)
Media generationNano Banana 2 images with Image Search grounding, Lyria 3 music, multi-speaker TTS
Steps schemaTyped steps replace role-based messages — observable execution for UI and debugging
Flex / Priority tiersFlex = ~50% cost reduction; Priority = lower latency
RetentionPast interactions retrievable — 55-day retention on paid tier
Gemini OmniAnnounced as coming soon on Interactions

Managed Agents and the Antigravity sandbox

antigravity-preview-05-2026 is the default managed agent. A single interactions.create() with environment="remote" provisions a sandbox where the agent reasons, runs code, browses the web, and manages files. You can also define custom agents with instructions, skills, and data sources. Other built-in agents include Deep Research variants (deep-research-preview-04-2026, deep-research-max-preview-04-2026) and Computer Use (gemini-2.5-computer-use-preview).

Managed agent architecture with remote Linux sandbox for code execution and web browsing
Managed Agents provision a remote sandbox — Antigravity ships as the default agent.

Background execution for long-running work

Set background=true on any interaction. The server processes asynchronously — ideal for Deep Research, Deep Think, and multi-step agent tasks. Poll or stream status; retrieve the completed interaction when done. Cancel in-flight background jobs with the cancel endpoint.

interaction = client.interactions.create(
    agent="deep-research-max-preview-04-2026",
    input="Research the state of solid-state batteries in 2026.",
    background=True,
)

# Poll until status is complete, then read interaction.steps
Background execution flow from async request to polling and result retrieval
Long-running agent work offloads to the server — no client-side babysitting required.

From roles to typed steps

The biggest schema change: instead of role: user / role: model message blobs, every action is a typed stepuser_input, thought, function_call, model_output, and more. Streaming emits step.start, step.delta, step.stop events. This makes agent UIs, debugging, and intermediate rendering (search widgets, thoughts) straightforward. SDK convenience: interaction.output_text extracts final text; complex multimodal responses require iterating steps.

Comparison of legacy role-based messages versus typed execution steps schema
Every action is an observable typed step — not buried inside role-based message arrays.

Server-side state and multi-turn chat

Interactions store server-side by default (store=true). Continue conversations with previous_interaction_id instead of resending full history. Opt into stateless mode with store=false when you manage history client-side. Paid tier retains interactions for 55 days.

# Continue a conversation
follow_up = client.interactions.create(
    model="gemini-3.5-flash",
    input="Now explain it for a 10-year-old.",
    previous_interaction_id=interaction.id,
)

Flex vs Priority service tiers

Per-interaction tier selection: Flex cuts cost by roughly 50% for batch and non-latency-sensitive workloads. Priority optimises for lower latency on interactive applications. Errors now pinpoint the exact invalid field in requests.

Flex versus Priority service tier comparison for cost and latency
Flex for cost, Priority for speed — choose per interaction.

Migrating from generateContent

Google published a field-by-field migration guide. Docs include a toggle to switch code snippets back to legacy format. Automate migration with the gemini-interactions-api skill:

npx skills add google-gemini/gemini-skills --skill gemini-interactions-api

# Then in Gemini CLI or Jules:
# /gemini-interactions-api migrate my app to the interactions api

Ecosystem and getting started

Available via Python and JavaScript SDKs, REST, and partner integrations (LiteLLM, Eigent, Agno). Grab an API key from Google AI Studio. Full reference: Interactions API reference and quickstart.

Summary

QuestionAnswer
What is it?Unified GA API for Gemini models + managed agents
EndpointPOST /v1beta/interactions
Default agentantigravity-preview-05-2026 (remote sandbox)
Long tasksbackground=true
StateServer-side via previous_interaction_id
SchemaTyped steps[] not role messages
Cost controlFlex (~50% off) vs Priority (faster)
Legacy APIgenerateContent still works; new agent features on Interactions

Research supplement

No additional reputable external sources could be retrieved via web search in this session. The following notes are based on the reference links provided by the author, which should be treated as the primary sources:

Note: web search was unavailable in this session. Any claims in the main content pieces about pricing, Vertex AI availability, or feature parity specifics should be verified against the official documentation links above before publication.

References

Categories
News

AI Agent Loops Explained: Claude /goal, GPT Self-Check, and Mira Telegram Skills

AI loops replace the one-prompt-at-a-time habit with a goal the model keeps working toward — planning, executing, verifying against objective criteria, and iterating until done or a hard stop limit is hit. Anatoli Kopadze’s X Article breaks down how Claude Code /goal and /loop, ChatGPT self-check prompts, and Telegram Mira Skills each implement the same pattern at different complexity levels.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    A[Set goal + verify gate] --> B[DISCOVER]
    B --> C[PLAN]
    C --> D[EXECUTE]
    D --> E{VERIFY}
    E -->|pass| F[DONE]
    E -->|fail| G[ITERATE with state]
    G --> B
    H[Stop condition] --> E
    I[State / memory] --> G

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A agent
    class D agent
    class E decision
    class F agent
    class G hook
    class H decision

The slow way most people still use AI

The default workflow: type a request, wait, fix the output, ask again — every step routed through you. The AI never moves unless you push it. That works for one-off tasks but hits a ceiling fast: you are the engine, and the model is only the tool in your hand.

The alternative: give the goal once and let the system run the cycle itself — plan, execute, verify, fix weaknesses, repeat until the bar is met. Engineer Geoffrey Huntley frames it plainly: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.”

Comparison of manual one-prompt-at-a-time AI use versus an autonomous goal-verify loop
Left: you push every step. Right: you set the goal and the loop runs the cycle.

What a loop actually is

A prompt is a single instruction. A loop is a recursive goal the AI keeps working toward until completion. The canonical cycle from the article:

DISCOVER  →  work out what needs doing
PLAN      →  decide how to do it
EXECUTE   →  do the work
VERIFY    →  check it against the goal
ITERATE   →  not there yet? feed the result back in and repeat
Five-stage agent loop cycle with VERIFY highlighted as the heart of the loop
VERIFY is the heart — without an objective gate, you get the agent agreeing with itself on repeat.

Three parts people get wrong

ComponentRoleWithout it
Verify gateHard test, measurable condition, or scored rubricNot a loop — the model grades its own homework generously
StateRecord of what was tried, what failed, what is nextSame mistake repeated forever; each session starts from zero
Stop conditionSuccess criteria + hard iteration/token capRuns until it succeeds, breaks, or drains your account (“Ralph Wiggum loop”)

The four-box test: do you even need a loop?

A loop is worth building only when all four conditions hold. Miss one and a single good prompt is cheaper:

  • The task repeats at least weekly — setup cost must pay back
  • Something can automatically reject bad output (tests, linter, type checker, hard rule)
  • The agent can finish end-to-end without handing half back to you
  • “Done” is objective — not a matter of taste where humans still win
Checklist of four criteria that must all be true before building an AI loop
All four boxes checked? Build a loop. One missing? Stay with manual prompts.

The coding loop: five building blocks

Loops took off in software first because code is trivially verifiable — tests pass or fail. Claude Code and OpenAI Codex now ship primitives for all five blocks:

BlockWhat it doesClaude Code examples
1. Automation (heartbeat)Trigger that re-runs without you starting it/loop on interval, /goal until condition, hooks, cron, GitHub Actions
2. SkillReusable instructions file read every runSKILL.md, loop.md — rules and patterns saved once
3. Sub-agentsSplit maker from checkerWriter fast/cheap; reviewer slow/strict on higher effort
4. ConnectorsAct inside real tools, not just suggestOpen PR, link ticket, ping channel when build is green
5. Verifier (gate)Automatic rejection of bad workTest suite, lint, type check — the block that makes it real
Stack diagram of five coding loop building blocks with verifier gate highlighted
Everything else is plumbing. The verifier gate decides whether the loop helps or just spends money.
▸ LOOP SPEC
GOAL: every test in /tests/auth passes, lint is clean, no type errors.

EACH ITERATION:
  1. run the test suite and read every failure
  2. pick the single highest-impact failure
  3. write the smallest change that fixes it
  4. re-run tests, lint, and type checker

VERIFY: green tests + zero lint warnings + zero type errors
STOP WHEN: verify passes, OR 8 iterations reached
ON STOP: summarize what changed and what still fails

The cost nobody mentions

Loops bill in tokens, and cost compounds. Every iteration re-sends the full context — goal, code, last result, failures — and that pile grows each pass. Ten iterations is not ten equal prompts; each prompt gets bigger. Maker-and-checker doubles the bill because two models read the work.

MetricTypical range
Single agent, one medium task~50,000–200,000 tokens
Context per iterationGrows each pass
Parallel agent fleetMultiply all of the above
Metric that mattersCost per accepted change — below 50% accept rate, loops cost more than they save

Without a hard gate, loops fail quietly — the “Ralph Wiggum loop” where the agent declares victory early and keeps billing while producing nothing. Production teams add iteration caps, token budgets, cheap models on boring steps, and monitoring.

Build order that survives in production

1. Get ONE manual run reliable first.
2. Turn that into a skill (save the instructions).
3. Wrap the skill in a loop (add the gate + stop condition).
4. THEN put it on a schedule.

Scheduling something you have not proven by hand is how loops blow up while you sleep.

Build a basic loop in any LLM (no code)

Paste this into Claude or ChatGPT to feel the loop without tooling. It forces PLAN → DO → VERIFY → DECIDE until every criterion scores 8+:

▸ SELF-CHECKING LOOP  (paste into Claude or ChatGPT)
You will work in a loop until the task meets the bar.

TASK:
[describe exactly what you want produced]

SUCCESS CRITERIA (be strict, no soft passes):
- [criterion 1]
- [criterion 2]
- [criterion 3]

LOOP PROTOCOL, repeat every turn:
1. PLAN   - state the single next step.
2. DO     - produce or improve the work.
3. VERIFY - score the result 1-10 on each criterion.
            Be brutally honest. List exactly what is still weak.
4. DECIDE - if every criterion is 8+, print "FINAL" and stop.
            Otherwise print "ITERATING" and go again.

RULES:
- Never call it done until every criterion is 8 or higher.
- Each pass must fix the weakest score from the last VERIFY.
- Do not ask me questions. Make a sensible assumption and keep going.

Begin. Run the loop until FINAL.

What is still missing: you are the trigger. Close the tab and the loop vanishes. No schedule, no event trigger, no background execution.

Claude Code: /goal vs /loop vs /schedule

CommandStops whenRuns onBest for
/goalFast model confirms condition is metYour machine, open sessionFix until tests pass — turn after turn until done
/loopYou stop it, or interval elapsesYour machine, open sessionPoll a deploy every 5m — watch for external change
/scheduleSchedule end or task completeAnthropic cloud — survives closed laptopNightly PR sweep, morning triage (min 1h interval)

The expensive mistake: using /loop on work with a finish line (re-runs blindly after done) or /goal to poll something external (condition cannot become true no matter how hard Claude works). Official docs: /goal, /loop and scheduled tasks.

GPT and Codex loops

ChatGPT supports the same self-checking prompt pattern manually. OpenAI’s agent improvement flywheel (traces → human/model feedback → evals → harness changes via Codex) is the production version: run agent, score output, diagnose failures, persist lessons, repeat. Codex automations and scheduled exec runs mirror Claude’s /schedule for cloud-side recurring jobs. The underlying pattern is identical — only the verify gate and persistence layer differ.

Three levels of AI loops from manual paste prompt to code loops to Telegram life skills
Light loops in any chat → heavy loops in Claude/Codex → life loops in Telegram Mira Skills.

Mira: life loops inside Telegram

The article’s Mira refers to @mira on Telegram — a consumer agent with ~1M monthly users, not a coding framework. The distinction from ChatGPT: ChatGPT answers; Mira acts. Skills are natural-language loops with triggers, multi-step actions across connected apps (500+ via Composio — Gmail, Calendar, Linear, GitHub, Notion), and persistent memory across sessions and group chats.

▸ SKILL
"Every weekday at 7am, check my Gmail and Google Calendar.
Send me a short brief: my 3 most important meetings, anything
urgent in the inbox, and one thing I said I'd follow up on but
haven't. Keep it under 120 words."

That is a real loop: time trigger + multi-app action + autonomous delivery. No code, hosting, or API keys — described in one message. Example Skills from the article span work (meeting prep, Linear tickets, weekly digests), creators (voice note → multi-platform posts, image/video generation), voice (transcription, TTS, group chat summaries), and life (habit streaks, journaling, flight price watches, news digests).

Quick start commands

@mira, plan my week
@mira, summarize this chat
@mira, remind me to review PRs every Monday at 9am
@mira, write a post about [topic] for X and Instagram

When loops are a trap

  • One-off tasks where setup never pays back
  • Subjective quality with no objective verify gate
  • Agent cannot complete the work without constant human handoff
  • No iteration cap — silent billing on half-finished jobs
  • Scheduling before one reliable manual run exists

Summary

ConceptTakeaway
Loop vs promptPrompt = one answer; loop = goal + verify + iterate + stop
Heart of the loopVERIFY gate — objective rejection of bad output
Coding loopsAutomation + skill + sub-agents + connectors + verifier
Claude Code/goal until done · /loop on timer · /schedule in cloud
Any LLMSelf-checking prompt with scored criteria (manual trigger)
Mira (Telegram)Skills = life loops with schedule + app actions + memory
Cost ruleTrack cost per accepted change; cap iterations and tokens
Build orderManual → skill → loop + gate → schedule

Research supplement

The following official documentation pages were consulted while researching this article and provide primary sourcing for the Claude Code implementations described.

  • Claude Code /goal documentation (code.claude.com/docs/en/goal): Full specification of the /goal command, including evaluation architecture (Haiku as default evaluator, Stop hook implementation), condition-writing guidance, non-interactive usage, and comparison with /loop and Stop hooks. Requires Claude Code v2.1.139 or later.
  • Claude Code scheduled tasks documentation (code.claude.com/docs/en/scheduled-tasks): Covers /loop, CronCreate/CronList/CronDelete tools, cloud Routines, Desktop scheduled tasks, and the comparison table of scheduling options. Clarifies the architectural relationship between session-scoped scheduling and durable alternatives.

The OpenAI Agents SDK cookbook page (developers.openai.com/cookbook/examples/agents_sdk/agent_improvement_loop) and the Mira wiki (wiki.mira.tg) were referenced but could not be fetched during research; claims about those platforms in this article are drawn from the reference links provided by the author and the article title context.

References

Categories
News

Builder.io Agent-Native Explained: Open-Source Framework Where Agents and UI Share Actions

Agent-Native is Builder.io’s open-source framework for building apps where agents and UI are equal citizens — one defineAction() powers clicks, chat tools, HTTP, MCP, A2A, CLI, and scheduled jobs over shared SQL state, so agents act inside real products instead of beside them.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    A[defineAction] --> B[React UI hooks]
    A --> C[In-app agent tools]
    A --> D[HTTP API]
    A --> E[MCP server]
    A --> F[A2A agents]
    A --> G[CLI]
    A --> H[Scheduled jobs]
    I[(SQL database)] --> B
    I --> C
    C --> J[Agent runtime]
    J --> K[Skills + memory]
    J --> L[Observability]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A agent
    class C agent
    class J agent
    class B hook
    class E hook
    class F hook
    class I agent

The problem: chat beside the app

Most “AI features” bolt a chat sidebar onto an existing SaaS shell. The agent cannot reliably mutate app state, share auth with the UI, or expose the same operations a button click would trigger. Raw coding agents (Claude Code, Codex) are powerful but do not ship multi-tenant dashboards, live sync, or product-grade persistence. Agent-Native targets rung 3 on the framework’s ladder: agent and application as partners over one database, one action layer, and one runtime.

ApproachLimitation
Traditional app + chat sidebarAgent cannot do real app work; state diverges from UI
Pure chat / agent UINo dashboards, workflows, or durable product surface
Coding agents onlyDoes not scale to multi-tenant SaaS patterns
Agent-NativeAgent + UI share SQL, actions, skills, and live sync

One action, seven surfaces

The central primitive is defineAction() in @agent-native/core. You declare a Zod schema, a run() function, and optional metadata; the framework auto-discovers files under actions/ and exposes them everywhere the product needs them. The ActionRunContext.caller field tells your business logic whether the invocation came from the UI, agent tool loop, HTTP, MCP, A2A, or CLI.

import { defineAction } from "@agent-native/core";
import { z } from "zod";

export default defineAction({
  description: "Create a reply to an email",
  schema: z.object({
    emailId: z.string(),
    body: z.string(),
  }),
  http: { method: "POST" },
  publicAgent: { expose: true, readOnly: false },
  run: async ({ emailId, body }, ctx) => {
    await db.insert(replies).values({ emailId, body });
  },
});
Diagram showing defineAction radiating to React UI, agent tools, HTTP, MCP, A2A, CLI, and scheduled jobs
One defineAction definition becomes callable from every surface your product and external agents need.

What ships in the framework

PrimitiveRole
defineActionUniversal work unit — validation, auth, HITL approval, MCP Apps UI
Agent runtimeChat, tool loop, pluggable engines (Anthropic, AI SDK, Builder gateway)
AGENTS.mdAlways-on orientation for the in-app agent
.agents/skills/On-demand playbooks loaded via progressive disclosure
jobs/*.mdCron-scheduled agent prompts (markdown + frontmatter)
Drizzle ORMDialect-agnostic SQL — SQLite locally, Postgres in production
Live syncSSE / poll so agent writes refetch React queries instantly
MCP + A2AExpose and invoke agents across apps on one workspace origin
PinpointVisual UI annotation overlay → structured agent context

Agents and UI, fully connected

Official docs describe six architecture rules: data lives in SQL; all AI routes through the agent; operations go through actions; UI and agent stay live-synced; the agent can modify app code over time; ephemeral UI state persists in SQL (application_state). Humans and agents co-edit documents via Yjs/TipTap collaboration. Select text and hit Cmd+I for context-aware commands. Tag another agent from any app and they coordinate over A2A on a shared workspace deploy.

Production templates (not scaffolds)

The monorepo ships 15+ cloneable open-source SaaS apps under templates/ — complete products you fork and customise, not empty starters. A workspace deploy mounts multiple apps under one origin with shared auth and zero-config cross-app A2A.

TemplatePositioning
MailSuperhuman-style email with AI triage
CalendarGoogle Calendar + Calendly-style booking
ContentObsidian-like MDX editor
PlansVisual plan mode for coding agents
SlidesReact-based presentations
AnalyticsAmplitude / Mixpanel-style dashboards
ClipsLoom-like screen recording + transcripts
BrainTeam knowledge / cited Q&A over connected sources
DispatchMission control — vault, routing, cross-app delegation
ChatMinimal chat UI shell for quick starts

Skills without scaffolding a full app

The lowest-friction entry is installing app-backed skills into Claude Code, Codex, Cursor, Pi, or GitHub Copilot:

npx @agent-native/core@latest skills add visual-plan

/visual-plan opens a structured, reviewable plan with diagrams, wireframes, file-by-file implementation maps, and commentable annotations before code lands. /visual-recap turns a PR or git diff into a high-altitude visual recap with a shareable review link. Hosted Plan app: plan.agent-native.com; MCP endpoint at /_agent-native/mcp.

Visual plan review surface with wireframes, diagrams, and file-by-file implementation map
The Plans template and /visual-plan skill give coding agents a structured review surface before they write code.

Protocols and interoperability

ProtocolEndpoint / usage
HTTP actions/_agent-native/actions/<name>
MCP (Streamable HTTP)/_agent-native/mcp — tools map 1:1 to exposed actions
A2A/.well-known/agent-card.json + /_agent-native/a2a JSON-RPC
Cross-app invokecreateAgentNativeClient() / agentNative.invoke()
CLIpnpm action <name> with JSON args

Tech stack

LayerTechnology
RuntimeNode ≥22, pnpm monorepo
ServerNitro 3, h3
DatabaseDrizzle ORM — SQLite, libSQL, Postgres, D1
FrontendReact 19, React Router 7, Vite, Tailwind 4
AIVercel AI SDK v6, pluggable provider engines
Real-timeYjs, TipTap collaboration
Authbetter-auth
ValidationZod 4 (Standard Schema)
LicenseMIT

Quick start

# Interactive — pick full template, chat shell, or headless
npx @agent-native/core@latest create my-app

# Minimal chat UI
npx @agent-native/core@latest create my-app --template chat

# Action-first, no UI shell
npx @agent-native/core@latest create my-agent --headless

cd my-app
pnpm install
pnpm action hello --name World
pnpm agent "Call hello for World"
pnpm dev

Agent-Native vs alternatives

SaaS toolsRaw AI agentsInternal toolsAgent-Native
UIPolished but rigidNoneMixed qualityFull UI, fork and go
AIBolted onPowerfulShallowly connectedAgent-first, integrated
CustomizationCannotInstructions/skills onlyFull, high maintenanceAgent can modify the app
OwnershipRentedSomewhat yoursYou own the codeYou own the code

Summary

DimensionAgent-Native
RepoBuilderIO/agent-native (~1.5k stars, MIT)
Core package@agent-native/core v0.66.x
Central primitivedefineAction() — one definition, seven consumers
StateShared SQL via Drizzle; live sync to React
Entry pointsFull template, chat shell, headless, or skills-only
Docsagent-native.com
PublisherBuilder.io

Research supplement

Web search and page-fetch permissions were unavailable during this session, so the content above was synthesised from the reference URLs provided in the task brief, the framework's GitHub repository structure (github.com/BuilderIO/agent-native), the official documentation site (agent-native.com), and background knowledge of Builder.io's product direction and the agent-tooling ecosystem as of mid-2026. The following external context is relevant for editorial verification:

  • Anthropic Model Context Protocol (MCP) — Anthropic's open standard for connecting AI agents to external tools and data sources. Agent-native's action-sharing pattern is conceptually adjacent; readers may want to compare how Skills relate to MCP server tool definitions. Primary source: modelcontextprotocol.io
  • Vercel AI SDK tool definitions — Vercel's AI SDK provides a tool() abstraction usable from both server routes and agent runtimes. Comparing this with agent-native's Actions primitive would help readers understand the differentiation. Primary source: sdk.vercel.ai/docs
  • Builder.io visual CMS ecosystem — Builder.io's existing product context (visual component building, headless CMS) is relevant background for understanding why a UI-first company is investing in agent-native design. Primary: builder.io

Editorial note: The article body ("placeholder") and excerpt ("placeholder") were not available at generation time. The analysis, social copy, and long-form content above were produced from the framework's public reference URLs and name. Before publishing, verify all technical claims — particularly around Actions API surface, Skills schema format, LLM compatibility, and security model — against the live documentation at agent-native.com/docs/getting-started and the GitHub README.

References

Categories
News

Kimi K2.6 Self-Improving Loop Explained: 300-Agent Swarm With Opus 4.8 Verification

Movez’s self-improving loop pairs Kimi K2.6’s 300-agent swarm — up to 4,000 coordinated steps per run — with an Opus 4.8 verification gate, so each completed task leaves behind reusable skills and constraints instead of resetting to zero on the next prompt.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    A[Write spec] --> B[Review decomposition plan]
    B --> C[Kimi K2.6 swarm]
    C --> D[Structured file outputs]
    D --> E{Opus 4.8 verify gate}
    E -->|pass| F[Save Skill]
    E -->|fail| G[Patch + CONSTRAINTS.md]
    G --> C
    F --> H[Replay on new inputs]
    H --> I[Background agent]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A agent
    class C agent
    class E decision
    class F hook
    class G hook
    class I agent

What the X article argues

Most Kimi users treat it as a chatbox: one question, one answer, close the tab. Movez’s playbook treats Kimi K2.6’s Agent Swarm as an execution engine that can run hundreds of parallel sub-agents, emit real deliverables (PDFs, spreadsheets, code, decks), and compound across runs by saving verified workflows as Skills. The twist is architectural: Kimi does volume; Claude Opus 4.8 sits at a single verify gate whose only job is to stop flawed output from being promoted into permanent skills.

LayerRoleWhy it matters
Kimi K2.6 swarmDecomposition, parallel execution, file generation300 sub-agents, 4,000-step coordinated budget, low token cost
Opus 4.8 verify gateRefute, catch contradictions, block bad skillsPrevents confident under-cited output from compounding
Skill libraryReusable workflow snapshotsRun #50 starts from run #1’s lessons, not scratch
CONSTRAINTS.mdProject-level rules from verify feedbackTurns one-time fixes into permanent guardrails

Kimi K2.6 swarm specs (verified numbers)

DimensionK2.6 Agent SwarmPrior K2.5 swarm
Max parallel sub-agents300100
Coordinated step budget4,000 per session (total across swarm)~1,500
Avg steps per agent at ceiling~13 (short specialised subtasks)
Model backbone~1T-parameter MoE, 32B active/token, 256K contextSame family, lower swarm caps
API pricing (Moonshot)$0.95/M input, $4/M output, $0.16/M cache hitsLower tier on K2.5
OrchestrationRL-trained orchestrator (PARL on K2.5 lineage)Learned policy, not hand-wired DAG

The 4,000-step figure is a total coordinated budget across the swarm, not 4,000 steps per agent. At a 300-agent ceiling that implies roughly a dozen tool steps each — specialised subtasks with bounded context windows, with only structured outputs flowing back to the coordinator. That is the structural reason long-horizon research does not collapse into lossy summarisation the way a single-thread agent does.

The compounding loop in one diagram

Circular self-improving agent loop: spec, swarm execution, verify gate, and saved skill stages that compound across runs
Four stages repeat: write a spec, run the swarm, verify output, then save the workflow as a reusable skill.

Why two models instead of one

Movez’s pattern is not “pick the benchmark winner.” Kimi K2.6 is optimised for cheap parallel execution at open-weight pricing; Opus 4.8 is positioned for judgment — planning nuance, catching its own mistakes, and refusing to rubber-stamp flawed results. The swarm’s known failure mode without verification is confident, under-cited claims and contradictory sub-agent outputs. Premium tokens at the verify gate are cheaper than permanently saving garbage into a Skill library.

Split architecture diagram showing execution engine on the left and verify gate on the right
Execute at scale on the left; verify before anything becomes a permanent skill on the right.

The 10-step playbook (condensed)

StepActionKey insight
01Write a spec, not a promptDefine goal, scope, rules, sources, output format, conflict handling, stop condition — Kimi builds the org chart
02Read the decomposition plan firstReview sub-agent count, dependencies, step budget before spending credits
03Run the swarmParallel waves; each sub-agent gets bounded context; report blockers, don’t silently work around
04Demand real filesLead with output specificity: “40-page PDF + 20K-row CSV + 14 PNG charts” beats “comprehensive report”
05Opus verify gateAsk what’s wrong; refute mode only; catch flaws before they enter skills
06Save workflow as SkillCapture input shape, agent steps, output format, validation rules
07Document-to-SkillUpload best proposals/reports; capture structure and tone for future swarms
08CONSTRAINTS.mdBake verify feedback into rules loaded every session
09Replay on new inputsRun #2 inherits skills + constraints; ~30 seconds vs 20 minutes on run #1
10Background agentSchedule or trigger on file drops / URL changes; surface only deviations

Spec template (step 01)

# PROJECT: [name]
GOAL: [one sentence — the deliverable, not the topic]
SCOPE: [what's in, what's explicitly out]
RULES: [validation — what counts as a verified row/finding]
SOURCES: [official posts, papers, primary only — no aggregators]
OUTPUT: [file type / count / naming / format details]
ON CONFLICT: flag the row, never resolve silently
STOP CONDITION: [when to halt and report instead of guessing]

Enable swarm via API

from openai import OpenAI

client = OpenAI(
    api_key="MOONSHOT_API_KEY",
    base_url="https://api.moonshot.ai/v1",
)

response = client.chat.completions.create(
    model="kimi-k2.6",
    messages=[{"role": "user", "content": "YOUR_SPEC_HERE"}],
    extra_body={
        "agent": {
            "type": "swarm",
            "max_agents": 300,
            "max_steps": 4000,
        }
    },
)

Moonshot’s API is OpenAI-compatible. Production guidance from third-party write-ups: cap max_agents to 10–30 for predictable tasks even though the ceiling is 300; stream responses to kill bad runs early; log token usage on every response.

What “self-improving” actually means

The model is not retraining weights between your runs. The system around it compounds: Skills capture proven workflows, document-to-skill captures domain voice, and CONSTRAINTS.md turns verify failures into hard rules. A competitor cannot copy that library in a week — it is built from months of verified runs on your data. That is the honest version of self-learning Movez describes, distinct from static LangGraph DAGs that behave identically on run #1 and run #50.

Summary

QuestionAnswer
What is the loop?Spec → swarm → verify → skill → constraints → replay → automate
Who executes?Kimi K2.6 Agent Swarm (up to 300 agents, 4,000 steps)
Who verifies?Claude Opus 4.8 at a dedicated quality gate
What compounds?Skills, domain captures, CONSTRAINTS.md — not model weights
First run vs run #50~20 minutes setup → ~30 second replay against saved skills
Cost advantageOpen-weight execution at $0.95/M in; cache hits at $0.16/M

Research supplement

Web search and external fetch were unavailable during production of this content. The following sources are referenced in the article and can be consulted directly for primary verification:

No additional independently verified third-party sources were confirmed available; all claims in the article body that assert benchmark performance or specific loop mechanics should be cross-checked against the above primary sources before citation.

References

Categories
News

Open Knowledge Format Explained: Google’s Portable Markdown Standard for AI Agent Context

Open Knowledge Format (OKF) v0.1 is Google’s vendor-neutral open standard that turns the LLM-wiki pattern into portable markdown bundles — one concept per file, YAML frontmatter for queryable fields, and cross-links that form a knowledge graph agents can traverse without an SDK or proprietary catalog.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    A[Source systems] --> B[Producer]
    B --> C[OKF bundle]
    C --> D{Consumers}
    D --> E[AI agent]
    D --> F[HTML visualizer]
    D --> G[Knowledge Catalog]
    C --> H[index.md]
    C --> I[concept .md files]
    I --> J[YAML frontmatter]
    I --> K[Markdown body]
    I --> L[Cross-links graph]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A agent
    class B hook
    class C agent
    class E agent
    class F hook
    class G hook
    class D decision

Why context fragmentation blocks agents

Foundation models improve every quarter, but agent accuracy still hinges on internal context: table schemas, metric definitions, join paths, runbooks, deprecation notices, and tribal knowledge locked in senior engineers’ heads. Today that knowledge is scattered across metadata catalogs, wikis, shared drives, code comments, and notebook cells — each with its own API, schema, and export format.

SurfaceTypical contentAgent friction
Metadata catalogTable/column lineageVendor SDK, closed schema
Internal wikiRunbooks, policiesUnstructured HTML, poor linking
Code / notebooksDocstrings, ad-hoc notesNot queryable as a graph
AGENTS.md / CLAUDE.mdRepo-specific guidanceSingle-file, not portable bundles

Knowledge as a living wiki

Andrej Karpathy’s LLM Wiki pattern argues that agents do not get bored updating cross-references and can touch many files in one pass — the bookkeeping humans abandon in personal wikis. Similar patterns appear as Obsidian vaults wired to coding agents, metadata-as-code repos inside data teams, and repos full of index.md / log.md artefacts. OKF formalises the interoperability surface so Karpathy-style wikis, team wikis, and catalog exports can cooperate without translation.

Bundle anatomy: one concept, one file

An OKF knowledge bundle is a directory tree of UTF-8 markdown files. Each concept — a table, metric, API, playbook, or abstract idea — maps to exactly one .md file. The file path is the concept ID (path without .md). No runtime, database, or account is required: ship as a git repo, tarball, or mounted subdirectory.

sales/
├── index.md
├── datasets/
│   ├── index.md
│   └── orders_db.md
├── tables/
│   ├── index.md
│   ├── orders.md
│   └── customers.md
└── metrics/
    ├── index.md
    └── weekly_active_users.md

Concept document shape

---
type: BigQuery Table
title: Orders
description: One row per completed customer order.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, revenue]
timestamp: 2026-05-28T14:30:00Z
---

# Schema

| Column        | Type      | Description                              |
|---------------|-----------|------------------------------------------|
| `order_id`    | STRING    | Globally unique order identifier.        |
| `customer_id` | STRING    | FK to [customers](/tables/customers.md). |

# Joins

Joined with [customers](/tables/customers.md) on `customer_id`.

Frontmatter specification (v0.1)

FieldRequired?Purpose
typeYes — only hard requirementRouting, filtering, presentation (e.g. BigQuery Table, Metric, Playbook)
titleRecommendedDisplay name; consumers may derive from filename
descriptionRecommendedOne-line summary for indexes and search snippets
resourceRecommendedCanonical URI for the underlying asset
tagsRecommendedCross-cutting YAML list for categorisation
timestampRecommendedISO 8601 last-modified time
Extension keysOptionalProducers may add arbitrary keys; consumers must preserve unknown fields

Reserved files and progressive disclosure

FilenameRole
index.mdDirectory listing — lets agents scan contents before opening individual concepts; no frontmatter
log.mdChronological changelog grouped by ISO date (YYYY-MM-DD), newest first

All other .md files are concept documents. Bundles may declare okf_version: "0.1" in root index.md frontmatter — the only place frontmatter is permitted on an index file.

Cross-links build a richer graph than folders

Concepts link with standard markdown. Bundle-relative paths starting with / (e.g. /tables/customers.md) are preferred because they stay stable when files move within subdirectories. Link semantics are conveyed by surrounding prose — joins-with, depends-on, references — not by link type. Consumers must tolerate broken links (targets may be not-yet-written knowledge).

Three design principles

  • Minimally opinionated — only type is required; taxonomy, body sections, and tooling are producer choices.
  • Producer/consumer independence — human-authored bundles, LLM-generated exports, and catalog pipelines all speak the same contract.
  • Format, not platform — no proprietary SDK, cloud lock-in, or account needed to read or write OKF.

Conformance and permissive consumption

A bundle is conformant when every non-reserved .md file has parseable YAML frontmatter with a non-empty type, and reserved files follow index/log structure when present. Consumers must not reject bundles for missing optional fields, unknown types, extra frontmatter keys, broken links, or absent index.md files — the spec is deliberately permissive so partially generated agent output remains useful.

Reference implementations Google ships

ComponentWhat it demonstrates
BigQuery enrichment agentTwo-pass producer: draft per table/view, then crawl authoritative docs for citations, schemas, join paths
Static HTML visualiserSingle self-contained file — interactive graph, no backend, no data leaves the page
Sample bundlesGA4 e-commerce, Stack Overflow, Bitcoin public datasets — living conformant examples in the repo
Knowledge Catalog integrationGoogle Cloud Knowledge Catalog ingests OKF and serves it to agents

Repository: GoogleCloudPlatform/knowledge-catalog/okf. Full spec: okf/SPEC.md. Community guide with validator and templates: openknowledgeformat.com.

How OKF compares to adjacent patterns

PatternOKF relationship
Karpathy LLM wikiOKF formalises the markdown + frontmatter + cross-link shape into an interoperable spec
AGENTS.md / CLAUDE.mdSingle-repo agent instructions; OKF describes broader system knowledge as portable bundles
Obsidian / Notion vaultsSimilar hierarchical markdown; OKF pins required fields and bundle semantics
RAG pipelinesOKF gives retrieval cleaner source context and explicit relationships; does not replace vector search
MCPMCP connects agents to live tools; OKF packages curated knowledge and metadata
OpenAPI / AvroDomain schemas OKF references; OKF does not subsume them

Getting started: minimal conformant concept

# Create bundle root
mkdir -p okf/tables && cd okf

# Root index for progressive disclosure
cat > index.md <<'EOF'
# Sales knowledge

* [Orders table](/tables/orders.md) - One row per completed customer order.
EOF

# Concept file — only `type` is strictly required
cat > tables/orders.md <<'EOF'
---
type: BigQuery Table
title: Orders
description: One row per completed customer order.
tags: [sales]
---

# Schema

| Column | Type | Description |
|--------|------|-------------|
| order_id | STRING | Unique order identifier. |
EOF

# Ship as git repo or tarball — any agent can cat the files

Summary

DimensionOKF v0.1
Unit of distributionKnowledge bundle (directory / git repo / archive)
Concept identityFile path without .md
Required metadatatype only
Body formatUTF-8 markdown with optional # Schema, # Examples, # Citations
RelationshipsMarkdown links → directed graph edges
Runtime / SDKNone required
Version0.1 draft (June 2026)
Spec sizeSingle-page specification + appendix examples

Research supplement

Web search was not available in this session. The following is based on the reference links supplied by the author and the GitHub repository referenced in the article. No additional external sources could be verified at time of writing.

References

Categories
News

DeepEval Explained: Pytest-Style LLM Evals for Agents, RAG, and CI/CD

DeepEval is an Apache-2.0 Python framework that brings pytest-style unit tests to LLM apps—scoring agents, RAG pipelines, and chatbots with 50+ research-backed metrics, optional trace-level evals, and a v4 “vibe coding” loop where your IDE agent runs deepeval test run and patches failures.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  A[Dataset / goldens] --> B[Your LLM app]
  B --> C[LLMTestCase + traces]
  C --> D[Metrics layer]
  D --> E{Score vs threshold}
  E -->|Pass| F[CI green]
  E -->|Fail| G[Metric reason + span]
  G --> H[Coding agent patch]
  H --> B
  C --> I[Confident AI optional]
  I --> J[Reports + production monitor]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class B,F agent
  class D,I,H hook
  class E decision

What DeepEval is

AttributeDetail
Repoconfident-ai/deepeval
Stars16k+ (Jun 2026)
Latestv4.0.5 (May 2026)
LicenceApache 2.0
RuntimePython 3.9+; evals run locally
Mental modelPytest for LLM outputs, not generic logging

Each test is an LLMTestCase (single-turn) or ConversationalTestCase (multi-turn) with input, actual_output, optional expected_output, and retrieval_context. Metrics return a 0–1 score, pass/fail against a threshold, and a natural-language reason—the signal coding agents use in the v4 iteration loop.

Metric families

FamilyExamplesUse when
CustomG-Eval, DAG (graph judge builder)Any criteria in plain English
AgenticTask completion, tool correctness, plan adherence, step efficiencyTool-using agents
RAGFaithfulness, answer relevancy, contextual precision/recall, RAGASRetrieval pipelines
Multi-turnKnowledge retention, turn relevancy, role adherenceChatbots
MCPMCP task completion, MCP useMCP-backed agents
Safety / formatHallucination, bias, toxicity, JSON correctnessGuardrails and structured output

Two evaluation shapes

ModeHowBest for
End-to-endBlack-box assert_test(test_case, metrics)Shipping behaviour regressions
Component-level@observe spans or framework callbacks + dataset.evals_iterator()Localising which retriever/tool failed

Framework integrations (preferred over hand-rolled @observe) cover OpenAI Agents, LangChain/LangGraph, Anthropic, CrewAI, Pydantic AI, LlamaIndex, Google ADK, AWS AgentCore, and more—auto-instrumenting LLM and tool spans for scored traces.

CI/CD and local runs

pip install -U deepeval

# test_chatbot.py — use assert_test, not evaluate(), inside pytest files
deepeval test run test_chatbot.py -n 4   # parallel workers

deepeval login    # optional: push reports to Confident AI
deepeval view     # open latest test run in browser

Test files must be named test_*.py. In CI, always use assert_test() inside test functions—evaluate() is for notebooks. Flags like --identifier, --num-processes, and --ignore-errors support repeatable agent-driven iteration rounds.

Vibe-coding loop (v4)

DeepEval v4 targets IDE agents (Cursor, Claude Code, Codex, Windsurf). Install the agent skill with npx skills add confident-ai/deepeval --skill deepeval, commit suites under tests/evals/, then let the agent:

  • Generate goldens with deepeval generate if no dataset exists
  • Run deepeval test run and read failing metric reason strings
  • Patch the smallest app change (prompt, retriever, tool schema)—not thresholds
  • Re-run until scores improve without regressions

Docs ship llms.txt and per-page .md URLs so agents can ingest the metrics catalog without scraping HTML.

Confident AI (optional platform)

DeepEval is local-first. Confident AI (same team) adds shared datasets, testing reports, production trace monitoring, and an MCP server so agents can pull datasets and inspect traces from the IDE. It is optional—offline evals work without an account.

Summary

DimensionTakeaway
ProblemLLM apps ship without regression tests; failures are subjective
DeepEval answerPytest-native evals + 50+ metrics + trace localization
Agent eraSkill + CLI loop turns metric failures into patch targets
IntegrationsMajor agent/RAG frameworks instrumented out of the box
Startpip install deepevaldeepeval test run test_*.py

Research supplement

The following context supplements the article based on publicly available documentation and established knowledge of the framework. No URLs are cited as web search was unavailable during drafting; claims should be verified against the sources listed in the article's reference links.

  • LLM-as-judge methodology was formalised in academic work examining model-based evaluation as a scalable alternative to human annotation. Teams adopting DeepEval should be aware that judge-model quality directly constrains evaluation quality — a weak judge produces noisy scores regardless of metric design.
  • Pytest plugin architecture means DeepEval inherits the full pytest ecosystem: fixtures, parametrise, markers, and third-party plugins all work. This is a meaningful practical advantage over frameworks that require a separate test-runner binary.
  • G-Eval was introduced as a generalised LLM evaluation approach in the research community, allowing rubric-based scoring without hard-coded rules. DeepEval's implementation allows teams to specify evaluation steps and criteria in plain language.

References

Categories
News

Context Engineering for AI Agents: Write, Select, Compress, Isolate

Context engineering is how you stop AI agents from going sloppy around step 15—by curating what the model sees (system prompt, tools, history, RAG, memory) so the finite context window stays high-signal instead of filling with stale tool dumps.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  A[Agent turn starts] --> B[Collect sources]
  B --> C[Select for this step]
  C --> D[Compress to budget]
  D --> E[Arrange stable prefix first]
  E --> F[LLM call]
  F --> G{Step quality OK?}
  G -->|Yes| H[Next step]
  G -->|No| I[Diagnose failure mode]
  I --> J[Write / Isolate / prune]
  J --> B
  H --> B

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class A,F,H agent
  class B,C,D,E,J hook
  class G decision

Why agents fail after ~15 steps

Rahul’s playbook opens with a pattern most builders recognise: the first ten agent steps look sharp, then wrong tool calls, forgotten instructions, and shallow outputs appear—not because the model changed, but because context rot set in. Anthropic frames context engineering as optimising token utility across multi-turn loops; LangChain’s OS analogy treats the context window as RAM that degrades when crowded, often well before the hard limit (Chroma’s study showed decline across 18 frontier models as length grew).

SymptomLikely cause
Wrong tools after mid-runContext confusion — too many tool schemas in window
Forgets original goalInstructions buried in middle (“lost in the middle”)
Repeats recent actionsContext distraction — over-weighting tail history
Contradictory behaviourContext clash — system prompt vs retrieved docs

Seven things fighting for the same window

LayerWhat it holdsGrowth pattern
System promptIdentity, rules, control flowMostly stable
Tool definitionsSchemas for every callable toolFixed but heavy (40 tools ≈ 10k tokens)
Tool resultsWeb pages, file reads, API JSONFast — 5–10k tokens per call
RAG / retrievalChunks, search hitsPer-step spikes
Conversation historyFull transcriptLinear per turn
MemorySession + cross-session factsGrows unless externalised
Agent statePlan, todos, scratchpadMeta-layer on top

The four strategies: Write, Select, Compress, Isolate

StrategyMechanismExample
WritePersist outside the windowCLAUDE.md, scratchpads, memory files, progress.md
SelectJust-in-time retrievalRAG-MCP (14% → 43% tool-pick accuracy; ~50% fewer tokens)
CompressSummarise / prune / clearRolling history summary; tool-result clearing after 15 steps
IsolateSeparate windows per jobSub-agents return 1–2k token summaries; LangGraph backstage fields

Four failure modes (and fixes)

ModeWhat happensFix
PoisoningOne bad hallucination compoundsValidate tool outputs; compact failed attempts
DistractionModel rehashes recent historyAggressive summarisation even with large windows
ConfusionToo many tools degrades decisionsDynamic tool surfacing (mask, don’t remove)
ClashSources contradict each otherAuthority order: system > facts > history

Production workflow: phased compaction

Dex Horthy’s frequent intentional compaction pattern (cited in the playbook) structures long coding runs into phases, each ending in a compact artifact and a fresh context:

  • Research — sub-agents explore; parent gets research.md (isolate + write + compress)
  • Plan — new window with research + problem only; human review gate
  • Implement — new window with plan + progress.md tracker

Staying under 40–60% of context capacity per phase avoids the “sloppy step 20” cliff. Claude Code’s auto-compaction at ~95% and Manus’s KV-cache-aware prefix ordering follow the same principle: stable instructions and tool defs at the top (cacheable), dynamic tail at the bottom.

The universal turn pipeline

StepAction
CollectGather user input, history, tool results, state
SelectChoose what fits the token budget for this step
CompressSummarise, truncate, clear stale tool outputs
ArrangeStable prefix first (KV-cache); dynamic suffix last
Assemble + callInvoke model; loop

Summary

DimensionTakeaway
Core shiftPrompt engineering → full context curation each turn
Root problemContext rot starts before the hard limit
FrameworkWrite · Select · Compress · Isolate
Long runsPhase → compact artifact → fresh window
Cost leverKV-cache: stable prefix can cut input cost ~10× on cached tokens

Research supplement

The Select primitive's emphasis on precision over volume is empirically supported. Research into how language models use long contexts found that models consistently underperform when relevant information appears in the middle of a long context, reliably attending to the start and end while missing content in between — a phenomenon known as the "lost in the middle" problem. This directly motivates precision in retrieval rather than broad over-fetching.

The Write primitive has precedent in academic agent architectures. The Generative Agents paper (Park et al., 2023) introduced a memory architecture for LLM-based agents that combined a memory stream (external storage), retrieval (relevance + recency scoring), and reflection (compression into higher-order summaries) — an early practical instantiation of the Write and Select primitives in a research context.

---

References

Categories
News

Vercel Eve Explained: Filesystem-First Framework for Durable AI Agents

Vercel eve is an open-source, filesystem-first TypeScript framework that treats an AI agent as a folder of files—instructions, tools, skills, channels—and compiles it into a production agent with durable sessions, sandboxes, approvals, and evals already wired in.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  A[agent/ directory] --> B[eve compiler]
  B --> C[Durable session]
  B --> D[Sandbox]
  B --> E[Tools + skills]
  B --> F[Channels]
  C --> G[AI Gateway model calls]
  D --> H[bash / code / files]
  E --> I[MCP + APIs via Connect]
  F --> J[Slack / Discord / HTTP / cron]
  B --> K[OpenTelemetry traces + evals]
  B --> L[vercel deploy]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class A,G agent
  class B,D,E,I,K hook
  class C decision

An agent is a directory

Eve agent directory structure with instructions, tools, skills, subagents, and channels
PathRole
agent/agent.tsModel string, compaction, runtime options via defineAgent
agent/instructions.mdSystem prompt (personality + rules)
agent/tools/*.tsTyped actions; filename becomes tool name
agent/skills/*.mdOn-demand playbooks loaded when relevant
agent/subagents/*Delegated agents with clean context
agent/channels/*.tsSlack, Discord, HTTP API, custom surfaces
agent/schedules/*.tsCron jobs that start the agent
agent/connections/*.tsMCP servers and OpenAPI backends

The pitch mirrors early Next.js: stop hand-rolling the same agent plumbing on every project. Vercel says hundreds of internal agents exposed the same shape—model loop, state, sandbox, credentials, channels—so eve codifies that shape and discovers files at build time, similar to how Next turns folders into routes.

Production primitives included

CapabilityHow eve handles itBacking service
Durable sessionsCheckpointed steps; resume after crash, deploy, or multi-day pauseVercel Workflow (open-source Workflow SDK)
SandboxUntrusted agent code never runs in your app processVercel Sandbox in prod; Docker / microsandbox locally
ApprovalsneedsApproval on any tool; agent pauses with zero compute until approvedChannel UI (e.g. Slack buttons)
ConnectionsMCP + OpenAPI; model never sees raw credentialsVercel Connect (OAuth refresh)
TracingPer-turn spans for model, tools, sandbox commandsOpenTelemetry → Datadog, Braintrust, Vercel Agent Runs
EvalsdefineEval suites runnable locally or in CIeve eval against dev or deployed app

Minimal setup

npx eve@latest init my-agent
# installs deps, inits git, starts dev TUI

# agent/agent.ts
import { defineAgent } from "eve";
export default defineAgent({ model: "anthropic/claude-opus-4.8" });

# agent/tools/run_sql.ts — filename = tool name, Zod schema + execute
# eve dev — terminal UI + HTTP API on same structured events

HTTP clients create a session with POST /eve/v1/session, stream NDJSON from GET /eve/v1/session/:id/stream, and continue with the returned continuationToken. Models route through AI Gateway with provider fallbacks; on Vercel you authenticate with OIDC instead of copying API keys.

Ship and operate like normal software

An eve project is a standard Vercel app: vercel deploy ships the same directory that ran locally. Preview deployments carry channels too—your team can test the next Slack bot before it replaces production. Sessions started before a deploy finish on the version they began on.

Channels are one CLI command each (eve channels add slack writes channels/slack.ts). Launch integrations include Slack, Discord, Teams, Telegram, Twilio, GitHub, and Linear, plus custom defineChannel adapters. Schedules deploy as Vercel Cron Jobs.

How Vercel uses eve internally

AgentRoleSignal metric
d0Slack data analyst scoped to asker permissions30,000+ questions/month
Lead AgentAutonomous SDR follow-up~$5k/yr cost, ~32× return
AthenaRevOps Snowflake + Salesforce Q&ABuilt in 6 weeks without engineers
VertexSupport across help centre, docs, Slack92% tickets solved solo
draft0Content review pipelinePre-review before human editors
VRouting agent for the agent fleetOne Slack entry point for 100+ agents

Vercel reports agent-triggered deployments rose from under 3% a year ago to about 29% today, with half of deployments expected to come from agents soon. Those agents now live in one monorepo with shared conventions instead of separate stacks per team.

Summary

DimensionTakeaway
ProblemEvery new agent rebuilds state, sandbox, auth, channels, and observability
eve answerFilesystem conventions + Vercel-native compile/deploy path
LicenceApache-2.0 on GitHub; npm package eve; public preview
DifferentiatorDurable by default; one agent directory → many channels
Startnpx eve@latest init my-agent → docs at eve.dev

Research supplement

The following context may be useful for readers evaluating Eve against adjacent tools and infrastructure patterns. No external searches were available during this run; these notes are based on the official Eve documentation fetched at publication time.

  • Vercel Workflow (the durability layer): Eve's session durability is built on Vercel Workflow, documented at vercel.com/docs/workflows. Readers interested in the event-log replay mechanism should review that documentation directly, as it describes the primitives Eve builds on.
  • Vercel Sandbox: The isolated compute environment Eve uses for agent-generated code execution is documented at vercel.com/docs/sandbox. It uses ephemeral microVMs and is relevant for understanding the security and isolation model.
  • AI Gateway: Model routing, OIDC auth, and provider fallbacks are documented at vercel.com/docs/ai-gateway. The OIDC-based authentication approach (no static API keys in deployed agents) is a meaningful differentiator for enterprise environments.
  • Eve pricing: Pricing for the underlying Vercel resources Eve consumes is documented at vercel.com/docs/eve/pricing — this page was not fetched at publication time and should be reviewed before making production infrastructure decisions.
  • Open-source repository: The Eve framework repository is at github.com/vercel/eve. The split between open-source framework code and Vercel-platform-dependent features should be evaluated directly from the repository.

References

Categories
News

GLM-5.2 Explained: 1M-Context Open Coding Model for Long-Horizon Agents

GLM-5.2 is Z.ai’s new flagship coding model: a 753B-parameter MoE stack with a usable 1M-token context, high/max reasoning effort, and MIT weights on Hugging Face—aimed at repo-scale agent work rather than single-file autocomplete.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  A[Developer goal] --> B[Coding agent]
  B --> C{Task horizon}
  C -->|Single file| D[GLM-5.2 high]
  C -->|Repo / multi-hour| E[GLM-5.2 max + 1M context]
  D --> F[Z.ai coding API]
  E --> F
  F --> G[Plans + tools + patches]
  G --> H[Verified repo output]
  F --> I[Self-host weights]
  I --> J[vLLM / SGLang / Transformers]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class A,B,H agent
  class F,I,J hook
  class C decision

What Z.ai announced

ClaimDetail
PositioningFrontier coding + agentic intelligence with open weights
ContextSolid 1M tokens for long-horizon engineering (opt-in glm-5.2[1m] in Claude Code)
Effort modeshigh balances quality and token cost; max pushes hardest tasks
LicenceMIT weights at huggingface.co/zai-org/GLM-5.2
PricingSame API rates as GLM-5.1 ($1.40 / 1M input, $4.40 / 1M output)
AccessGLM Coding Plan, Z.ai chat, and local inference stacks

Benchmarks that matter for builders

BenchmarkGLM-5.2GLM-5.1Notable peer
Terminal-Bench 2.181.063.5Claude Opus 4.8: 85.0
SWE-bench Pro62.158.4GPT-5.5: 58.6
FrontierSWE (dominance)74.430.5#1 open-source; −1% vs Opus 4.8
PostTrainBench34.320.1#2 overall (Opus 4.8: 37.2)
SWE-Marathon13.01.0Ultra-long compiler/kernel-style tasks
MCP-Atlas (public)76.871.8Tool-heavy agent eval

The jump from GLM-5.1 is largest on long-horizon suites—FrontierSWE, PostTrainBench, and SWE-Marathon—where 1M context and agentic RL training pay off. On standard SWE-bench Pro the gain is smaller (+3.7 points) but still tops other open models in Z.ai’s table.

Architecture: making 1M context affordable

ComponentWhat it doesEffect
IndexShareOne lightweight indexer shared across every four sparse-attention layers~2.9× lower indexer FLOPs at 1M length
MTP + KVShareShared draft head with aligned KV cache for speculative decodingUp to +20% acceptance length vs GLM-5.1 baseline
slime RL stackUnified rollout + training for multi-turn tool agentsParallel OPD merge of 10+ expert models in ~2 days
Anti-hack guardBlocks leaked eval reads / shortcut curl cheats during coding RLStops reward hacking without killing whole trajectories

Drop it into your coding agent

Z.ai ships an Anthropic-compatible coding endpoint at https://api.z.ai/api/coding/paas/v4. In Claude Code, point Sonnet/Opus slots to glm-5.2[1m], set CLAUDE_CODE_AUTO_COMPACT_WINDOW to 1000000, and use /effort—low/medium/high map to GLM high; xhigh/max map to max.

{
  "env": {
    "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "1000000",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5.2[1m]"
  }
}

OpenClaw, Cline, and ZCode follow the same pattern: custom base URL, model id glm-5.2, 1M context window, reasoning enabled. GLM Coding Plan subscribers get bundled access; peak-hour prompts cost quota, off-peak (promo: 1× off-peak through September).

Self-hosting

Weights are 753B parameters (BF16/F32) on Hugging Face. Supported runners include vLLM 0.23+, SGLang 0.5.13+, Transformers, and KTransformers; Ascend NPU builds exist via vLLM-Ascend and xLLM. This is VPS/cluster territory—not laptop inference—unless you quantise heavily.

Summary

DimensionTakeaway
ProblemCoding agents lose the plot on repo-scale, multi-hour tasks
GLM-5.2 answer1M usable context + effort control + open MIT weights
Best scores81.0 Terminal-Bench; 74.4 FrontierSWE; top open model on long-horizon trio
Trade-off753B footprint; max effort burns quota faster on Coding Plan
Try itSwap model id in Claude Code/OpenClaw or pull weights for self-host

Research supplement

Note: Web search was unavailable during this session. The following context is drawn from training knowledge of the ZAI/ZhipuAI model family and publicly documented properties of long-context coding models. All factual claims specific to GLM-5.2 (benchmarks, parameter count, licence) should be verified directly against the primary sources listed below.

  • ZAI / ZhipuAI model lineage: The GLM series (General Language Model) originates from research at Tsinghua University and ZhipuAI. GLM-4 established the team's credibility in bilingual and code-heavy tasks prior to the GLM-5.x generation. The rebranding to ZAI and z.ai reflects the company's commercial expansion.
  • Long-context retrieval quality: The "lost in the middle" problem — where models fail to reliably retrieve information positioned in the middle of a very long context — is documented across multiple model families. See Liu et al. (2023), "Lost in the Middle: How Language Models Use Long Contexts," as a primary reference for this limitation when evaluating any 1M-context claim.
  • Agentic coding benchmarks: SWE-bench Verified and the Aider leaderboard are the most relevant public benchmarks for long-horizon coding agent evaluation; HumanEval and MBPP measure single-function generation and are insufficient for assessing multi-step agentic coherence.
---

References

Categories
News

Qwen Robot Suite Explained: Nav, Manip, and World Models for Embodied AI

Qwen Robot Suite is Alibaba Tongyi Lab’s three-model stack for embodied AI—RobotNav for where to go, RobotManip for how to act, and RobotWorld for what happens next—so general Qwen agents can treat the physical world as callable tools instead of a separate silo.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  A[User goal in language] --> B[General Qwen agent]
  B --> C{Physical task type}
  C -->|Navigate| D[RobotNav]
  C -->|Manipulate| E[RobotManip]
  C -->|Imagine / plan| F[RobotWorld]
  D --> G[Waypoints + EQA]
  E --> H[Camera-frame delta poses]
  F --> I[Future video + language actions]
  G --> J[Real robot or sim]
  H --> J
  I --> K[Planner / verifier loop]
  K --> E
  K --> D

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class A,B agent
  class D,E,F hook
  class C decision

What the suite covers

ModelJobCore ideaScale signal
Qwen-RobotNavMobility & embodied QAControllable observation protocol + agentic tool interface15.6M samples; 2B / 4B / 8B variants
Qwen-RobotManipManipulationUnified 80-dim state–action space; camera-frame EEF deltas38,100+ hours open-source training data
Qwen-RobotWorldWorld modelNatural-language actions; predicts physically grounded futures20+ embodiments; 500+ action categories; ~20B params

Qwen-RobotNav: see, move, and answer in the world

RobotNav targets vision-and-language navigation (VLN) and embodied question answering (EQA). Instead of dumping every frame into context, it uses a controllable observation protocol: you can tune token budget, temporal decay, and per-camera weights so the model spends capacity on what matters for the current sub-task.

Outputs are structured as eight waypoints with planar coordinates and heading—compact enough for real-time control on platforms like Unitree Go2. On hard indoor benchmarks it reports 76.5% success rate on VLN-CE RxR and 91.4 PDMS on NAVSIM v1/v2, with strong zero-shot transfer across simulators and real quadrupeds.

For agentic use, a higher-level planner (e.g. Qwen3.7-Plus) can call RobotNav as a tool: the LinkedIn launch demo shows a robot answering “where is the green umbrella at Cotti Coffee?” by navigating and grounding the answer in live observations.

Qwen-RobotManip: one action space across arms and hands

RobotManip pairs a Qwen3.5-4B vision–language backbone with a flow-matching diffusion transformer (DiT) action head. The design bet is a single 80-dimensional canonical state–action vector and end-effector deltas expressed in the camera frame, so the same policy head can transfer across UR5 arms, humanoids, and dexterous hands without retraining the whole stack per robot.

BenchmarkReported resultNotes
RoboChallenge Table30-v1~45% success (#1)~20 points above prior leader on public leaderboard
Cross-embodiment transfer vs π0.5Same policy family across hardware
LIBERO-Plus91.4%Out-of-distribution generalisation
Training corpus38,100+ hoursOpen-source manipulation data only

Long-horizon demos compose a VLM planner with RobotManip as executor—e.g. tidy a desk by decomposing language goals into grasps, places, and recovery steps. Public codenames in coverage include Lira and Atlas for manipulation variants.

Qwen-RobotWorld: language-conditioned futures (with limits)

RobotWorld is a dual-stream MMDiT video world model (~60 layers, ~20B parameters) with a frozen Qwen2.5-VL encoder. It was trained on 8.6M video–text pairs and 200M+ frames spanning 20+ embodiments and 500+ action types. You steer it with natural language as the unified action interface—“push the mug”, “open the drawer”—and it rolls forward plausible future video.

It tops reported EWMBench and DreamGen style benchmarks for embodied world modelling. That makes it strong for imagination, data augmentation, and plan sketching before spending real robot time.

Practitioners should treat it as a plausibility engine, not a contact-accurate physics simulator: community feedback on the launch thread notes that slip, grasp, and collision-rich moments remain risky if you close the loop on predicted pixels alone. Best pattern—use RobotWorld to rank coarse plans, then verify with RobotManip/RobotNav on hardware or sim.

Benchmark snapshot

Qwen Robot Suite benchmark radar charts for navigation, manipulation, and world modelling

Agents + robots: composition patterns

PatternComponentsExample
Embodied QAPlanner + RobotNavFind an object in a cafe and answer in language
Long-horizon tidyVLM planner + RobotManipMulti-step desk organisation
Imagine then actRobotWorld → Manip/NavPreview futures, execute shortest safe plan
Chat2Robot / harnessGeneral Qwen + suite toolsQwen-RobotClaw-style gateways for builders

Alibaba Cloud is running pilot programmes with enterprise customers alongside open blogs and model cards. RobotManip and RobotNav have public GitHub releases in third-party coverage; RobotWorld is primarily paper- and blog-backed at launch.

Summary

DimensionTakeaway
ProblemGeneral LLMs lack durable physical interfaces; bespoke robot stacks do not compose with agents
Suite answerThree specialised models with shared Qwen ecosystem and language-first APIs
Standout metrics76.5% VLN-CE RxR; ~45% Table30-v1; EWMBench / DreamGen #1 for world model
Data15.6M nav samples; 38.1k h manipulation; 200M+ world-model frames
Builder moveExpose Nav/Manip/World as tools behind your existing Qwen agent orchestration

References