Categories
News

Claude Code vs Codex vs Antigravity: 3 Operating Models of Agentic Software Engineering

The latest discussion around Claude Code, Codex, and Antigravity is not about which model is “best” in isolation. It is about which operating model fits your engineering workflow: interactive control, delegated execution, or agent-first orchestration.

This post includes a recreated infographic built from the shared LinkedIn analysis and official product documentation.

Recreated infographic comparing Claude Code, Codex, and Antigravity operating models

Why this comparison matters

We are moving from code completion to task execution. That shift changes the main decision from “which assistant writes code fastest” to “which control plane helps my team ship safely and consistently”.

Operating model breakdown

ToolPrimary modeBest fitMain trade-off
Claude CodeInteractive controlEngineers who want close supervision inside existing repo workflows.Higher operator involvement per task.
CodexDelegated executionTeams that batch and hand off larger tasks for asynchronous completion.Needs strong review gates after execution.
AntigravityAgent-first orchestrationTeams managing multiple agents across editor, terminal, and browser surfaces.Requires mature orchestration and governance habits.

Evidence from official docs

  • Claude Code docs emphasise context-window management and starting fresh sessions when conversation quality degrades.
  • Codex AGENTS.md docs describe instruction-chain loading at session start, reinforcing a startup-configuration mindset.
  • Google Antigravity launch materials position the product as agent-first, with manager and editor surfaces for asynchronous orchestration.

Decision framework for teams

QuestionIf answer is yesLikely model preference
Do we need tight, continuous engineer oversight on live code edits?Keep the engineer deeply in loop.Interactive control.
Do we prefer to queue work and review completed outputs in batches?Optimise for delegation throughput.Delegated execution.
Are we building a multi-agent workflow across multiple surfaces?Optimise for orchestration and coordination.Agent-first orchestration.
Is failure recovery/rollback policy more important than raw generation speed?Prioritise governance over novelty.Any model, but with strong control-plane tooling.

Practical adoption pattern

  • Start with one operating model per team, not all three at once.
  • Define review and rollback protocol before increasing delegation depth.
  • Track context drift and failed handoffs as first-class engineering metrics.
  • Scale agent autonomy only where verification can remain cheap and reliable.

Bottom line: this is less a model race and more an execution-design choice. Teams that choose the right operating model for their delivery system will outperform teams that choose by hype alone.

Sources: LinkedIn post by Brij kishore Pandey, Anthropic Claude Code best practices, OpenAI Codex AGENTS.md guide, and Google Antigravity launch blog.

Categories
News

Managing Handoffs in Multi-Agent Coding Sessions: Fresh Context Without Losing Continuity

Multi-agent coding works best when each session starts with clean context, but teams still need reliable continuity across parallel runs. This post translates that tension into an operational handoff protocol grounded in official Claude Code and Codex guidance plus field patterns from the shared LinkedIn thread.

Managing handoffs in multi-agent coding sessions

What the official docs actually say

SourceKey statementOperational meaning
Anthropic Claude Code best practicesContext window fills fast; if you have corrected the same issue repeatedly, use /clear and start fresh.Long noisy sessions degrade quality; reset aggressively between unrelated tasks.
Anthropic memory docsEach session starts with a fresh context window; continuity comes from CLAUDE.md and auto memory.Durable intent should live in files, not chat residue.
OpenAI Codex AGENTS.md guideInstruction chain is assembled at session start from AGENTS.md scopes.Prompt-state should be treated as startup configuration, not ad-hoc memory.
OpenAI Codex features/clear starts a fresh chat while keeping workflow moving.Fresh-session discipline is now cross-vendor practice.

The real failure mode in multi-session work

The hard part is not writing a handoff file. The hard part is quickly proving that a handoff is yours, current, and actionable before you spend tokens and time in the wrong branch. Once teams run parallel sessions, handoffs become distributed-systems state management, not personal notes.

Reference flow for fresh-session handoffs

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart TD
    A[Session N closes] --> B[Write HANDOFF file]
    B --> C[Set status and next-action]
    C --> D[Record do-not-repeat list]
    D --> E[Commit handoff on same branch]
    E --> F[Session N+1 starts fresh]
    F --> G[Validate branch plus status]
    G --> H[Execute next-action only]
    H --> I[Update handoff status]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A,F,H,I agent
    class B,C,D,E,G hook

Minimal handoff contract that scales

FieldRequired valueReason
FilenameHANDOFF_YYYY-MM-DD_branch_topic.mdInstant disambiguation across branches and sessions.
statusactive, blocked, done, stalePrevents stale files from pretending to be live.
branchExact git branch nameStops cross-branch drift and wrong-start edits.
goalOne sentence outcomeKeeps scope narrow in fresh sessions.
next-actionSingle first executable stepRemoves startup hesitation.
do-notBulleted list of already-tried dead endsAvoids repeating failed loops and token waste.

Example handoff template

---
status: active
branch: cursor/handoff-protocol
goal: ship deterministic handoff validation for parallel coding sessions
next-action: run handoff validator and fix filename collisions
---

## Context
- Last completed step: parser now reads frontmatter and branch fields.
- Current blocker: stale files from old branches still match topic names.

## Do-not
- Do not reuse topic-only filenames.
- Do not trust undated HANDOFF.md at repo root.
- Do not start implementation before branch and status check pass.

## Evidence
- Validation command: npm run handoff:check
- Failing file: handoffs/HANDOFF_auth.md

Risk matrix for teams running parallel agent sessions

Failure modeTypical symptomControl
Stale state reuseSession starts from outdated assumptionsMandatory status field plus stale auto-flag after inactivity window.
Ownership ambiguityAgent edits wrong branch or abandoned workstreamFilename includes branch plus date; verify before first command.
Instruction bloatLong guidance files reduce adherenceKeep persistent rules concise in CLAUDE.md or AGENTS.md; move task detail to handoff file.
Correction loopsRepeated fixes in degraded contextReset with fresh session and carry only validated handoff contract.
Duplicate effortSame fix is rebuilt in multiple sessionsMaintain strict do-not section and last-completed evidence links.

Practical operating routine

  • Start each major work item in a fresh session.
  • Load durable project rules from CLAUDE.md or AGENTS.md only.
  • Use one active handoff per branch and topic.
  • Require status and next-action before any tool run.
  • Archive or mark stale handoffs quickly to reduce ambiguity.

Bottom line: fresh sessions are not in conflict with continuity. They require continuity to move from chat history into operational artefacts. Treat handoffs as typed state, not prose, and multi-agent coding becomes predictable instead of fragile.

Sources: Anthropic best practices, Anthropic memory docs, OpenAI Codex AGENTS.md guide, OpenAI Codex features, and the shared LinkedIn post by Deepak Dhingra.

Categories
News

Dive into Claude Code: In-Depth Research Analysis of Agent Harness Architecture

This deep research post analyses the paper Dive into Claude Code: The Design Space of Todays and Future AI Agent Systems, cross-checks the companion repository, and translates the findings into practical architecture guidance for teams building autonomous coding agents.

Source visual from the LinkedIn thread

LinkedIn visual summarising Claude Code harness insights

Primary sources used for this analysis

Executive findings

FindingWhat it means in practice
Core loop is simpleThe agentic heart is a while-loop: assemble context, call model, run tools, repeat.
Most complexity sits in the harnessThe published analysis repeatedly highlights that orchestration systems, not raw model calls, dominate implementation complexity.
Safety is layeredPermission modes, deny-first checks, hooks, classifier support, and sandbox boundaries are combined as defence-in-depth.
Context is treated as a hard resourceMulti-stage compaction runs before model calls to preserve task continuity under long sessions.
Architecture is context-dependentThe OpenClaw contrast shows no universal blueprint: deployment context changes the right safety and orchestration choices.

Architecture evidence from the paper figures

Seven-component system structure from the paper

Figure reading: the paper decomposes the system into seven components: user, interfaces, agent loop, permission system, tools, state and persistence, and execution environment. This supports the claim that production agent quality depends on integration quality across these boundaries.

Runtime turn flow from the paper

Figure reading: each turn routes through context assembly, model call, tool dispatch, permission gate, and execution feedback. This makes the permission and tool pipeline first-class runtime infrastructure, not side features.

Layered subsystem architecture from the paper

Figure reading: the five-layer view clarifies why agent products are difficult to reproduce by only cloning a prompt loop; surface, safety/action, memory, and runtime layers have cross-cutting interactions.

Deep analysis: what this paper adds technically

AreaPaper insightEngineering implication
Design philosophyMaps five human values to thirteen design principles.Architecture decisions should be policy-backed, not only benchmark-backed.
Execution modelOne shared query loop across interfaces.Unifying execution paths lowers mode-specific drift and improves debuggability.
Permission postureDeny-first with escalation and multi-mode trust spectrum.Use explicit trust transitions instead of static global permissions.
Context engineeringFive-stage compaction before model calls.Treat token budget as infrastructure capacity planning, not prompt formatting.
ExtensibilityMultiple mechanisms (MCP, plugins, skills, hooks).Not all extension surfaces should have equal context or safety cost.
PersistenceAppend-oriented session state and resumability patterns.Auditability and replayability should be built in from day one.

Claude Code vs OpenClaw: architectural contrast

DimensionClaude Code sideOpenClaw sideWhy this matters
Runtime shapeSingle coding-focused loopGateway-style control planeDifferent product goals drive different system boundaries.
Safety granularityPer-action safety evaluation and permission layersMore perimeter-style deployment assumptionsRisk control is environment-sensitive.
Context strategyCompaction and context-window managementGateway-wide capability registration modelMemory strategy follows deployment surface.
User workflowRepository session depthMulti-channel continuityInterface scope changes persistence and orchestration needs.
Extensibility pressureCoding-tool depth and local execution controlChannel breadth and service integrationExtension ecosystems need different governance defaults.

Critical reading notes and limitations

  • The study is a source-level architectural analysis, not a controlled benchmark showing task win rates across multiple production datasets.
  • The paper reports design mappings and subsystem decomposition; teams still need domain-specific evaluation for code quality, latency, and security outcomes.
  • Comparative conclusions are strongest at architecture level and should not be over-interpreted as absolute performance rankings.
  • For adoption decisions, pair this analysis with your own offline replay tests and guarded live trials.

Practical blueprint for agent builders

PriorityImplementation action
1Separate model reasoning from deterministic enforcement and execution surfaces.
2Adopt deny-first permissions with progressive trust modes and explicit escalation paths.
3Build layered context compaction before scale testing long-horizon tasks.
4Use extension tiers with clear context and risk budgets (hooks, skills, tool bridges).
5Persist append-first execution traces to support debugging, audit, and replay.
6Evaluate architecture quality on both productivity and human comprehension retention.

Bottom line: this paper is important because it reframes agent quality: model capability is necessary, but production value is decided by harness architecture. If two teams use similar base models, the team with better permissions, context engineering, execution isolation, and persistence design will usually ship the safer and more reliable agent product.

Categories
News

Stop Chunking, Start Reasoning: Why Vectorless RAG Is Quietly Winning

Chunk-heavy pipelines can lose document context, while vectorless retrieval keeps structure intact and routes reasoning to the right section before answer generation.

Architecture comparison

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart LR
    A[Document] --> B[Chunking]
    B --> C[Embedding]
    C --> D[Vector DB]
    D --> E[Top K chunks]
    E --> F[Answer generation]

    G[Document] --> H[Structured index]
    H --> I[Query routing]
    I --> J[LLM reasoning]
    J --> K[Precise section]
    K --> L[Answer generation]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A,B,E,F,G,L agent
    class C,D,H,I,K hook
    class J decision

Traditional vector RAG vs vectorless RAG

DimensionTraditional vector RAGVectorless RAG
Core retrieval unitSmall chunksDocument structure and sections
Context handlingContext can fragment across chunksContext remains intact
Infra requirementEmbeddings plus vector databaseNo embedding model required
Routing strategySimilarity search to top K chunksHierarchical navigation plus reasoning
Claimed accuracy in shared visual~50 percent98.7 percent

Where vectorless RAG fits best

  • Regulatory and filing-heavy workflows where section boundaries matter.
  • Legal contracts where exact clause retrieval is more important than semantic proximity.
  • Multi-page reports where preserving hierarchy improves auditability and trust.

Implementation checklist

StepAction
1. Parse structureExtract headings, sections, and nested references into a structured index.
2. Route queryUse query intent to navigate the index before generating text.
3. Ground answerGenerate from the selected section, then cite exact source segments.
4. EvaluateTrack exact-match and section-level retrieval accuracy, not only semantic relevance.

Bottom line: if your domain rewards precision over broad similarity, vectorless RAG can reduce infrastructure complexity while improving answer faithfulness. Benchmark numbers above are reported from the shared comparison visual and should be validated on your own corpus.

Source visual

LinkedIn infographic about Claude Code harness engineering and context management
Categories
News

gh skill: install, pin, and publish agent skills from GitHub repos

GitHub CLI now includes gh skill (preview): a small package-manager-style surface to discover, pin, install, update, and publish agent skills straight from repositories—skills that follow the open Agent Skills specification and can target multiple coding-agent hosts from one workflow.

What “agent skills” are here

Skills bundle portable instructions, scripts, and assets so an agent host knows how to carry out a repeatable task (documentation style, release checklist, internal API patterns). Installations land in the correct host-specific directory; you can scope to user or project and select the host explicitly when it is not the default.

Flow from repository to running agent

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart LR
    R[Skills repo on GitHub] --> G[gh skill]
    G --> M[Provenance in SKILL.md frontmatter]
    M --> H{Agent host}
    H --> C[Copilot]
    H --> D[Claude Code / Cursor / Codex / Gemini CLI / …]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class R agent
    class G hook
    class M decision
    class C hook
    class D hook

Prerequisites

Upgrade GitHub CLI to v2.90.0 or newer before the subcommands appear. The feature ships as public preview and may change without notice.

Commands you will use daily

CommandPurpose
gh skill search <query>Discover skills by keyword.
gh skill install OWNER/REPOBrowse a repository interactively and install chosen skills.
gh skill install OWNER/REPO skill-nameInstall one named skill; append @tag or @commitsha for a fixed ref.
gh skill preview OWNER/REPO skill-nameInspect content before it ever touches your machine—treat as mandatory for third-party repos.
gh skill update / --allRefresh installed skills using provenance metadata and remote tree comparison.
gh skill publish / --fix / --dry-runValidate skills for publishing; optional auto-fix of metadata.
# Examples (from upstream documentation)
gh skill install github/awesome-copilot
gh skill install github/awesome-copilot documentation-writer@v1.2.0
gh skill install github/awesome-copilot documentation-writer@abc123def
gh skill install github/awesome-copilot documentation-writer \
  --agent claude-code --scope user

Pinning and provenance

Skills are executable policy: treat them like packages. --pin locks a skill to a tag or commit so broad update runs skip it until you deliberately bump the pin. Installs record the git tree SHA of the source directory; updates compare local and remote trees so you see real content drift, not cosmetic version bumps. Metadata is written into SKILL.md frontmatter so provenance travels if you copy the folder between machines or repos.

Publishing checklist for maintainers

ControlWhy it matters
Immutable releasesTag-based installs stay byte-stable even if a repo is later compromised.
Secret scanning / code scanningPublish path surfaces repo hygiene recommendations before skills go public.
Spec validationgh skill publish checks against the public Agent Skills specification so hosts parse files predictably.

Security posture (non-negotiable)

Upstream documentation is explicit: skills are not verified by GitHub and may contain prompt injection, concealed instructions, or hostile scripts. Use gh skill preview, read diffs like production code, and only install from organisations you trust—especially in CI images where a poisoned skill could exfiltrate tokens on the next agent run.

Supported agent hosts (install flag)

HostExample flag
GitHub CopilotDefault path—no extra --agent required for typical Copilot installs.
Claude Code--agent claude-code
Cursor--agent cursor
Codex--agent codex
Gemini CLI--agent gemini
Antigravity--agent antigravity

Alias: gh skills maps to the same command group—handy for muscle memory. Run gh skill --help after upgrading gh to see the exact subcommand set on your build.

Categories
News

Agents CLI: scaffolding, evals, and deploy for ADK on Google Cloud

Google’s Agents CLI packages the lifecycle around the open-source Agent Development Kit (ADK): scaffold an ADK Python project, wire tools and orchestration, run evaluation harnesses, and push builds to managed Google Cloud targets—whilst optional “skills” teach coding agents the same workflows end-to-end.

Where Agents CLI sits in the stack

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart TB
    subgraph dev [Developer machine]
        A[Coding agent with skills] --> B[agents-cli]
        B --> C[ADK Python project]
    end
    C --> D[Local run / Dev UI]
    C --> E[Eval sets + judges]
    E --> F{Ship}
    F --> G[Agent Runtime / Cloud Run / GKE]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A agent
    class B hook
    class C agent
    class D hook
    class E decision
    class F decision
    class G agent

Install and prerequisites

RequirementNotes
Python3.11+
uvRecommended runner for uvx installs
Node.jsRequired for skills installation path
Optional for deployGoogle Cloud SDK, Terraform
PlatformsmacOS, Linux, Windows via WSL 2; native Windows not officially supported
# One-shot setup (installs CLI + skills bundle for coding agents)
uvx google-agents-cli setup

# Alternatives from upstream docs
# pipx install google-agents-cli && agents-cli setup
# pip install google-agents-cli && agents-cli setup
# Skills only: npx skills add google/agents-cli

Existing gcloud application-default credentials are picked up automatically when present—useful for iterative deploy loops without pasting keys into the agent transcript.

Bundled skills (what the coding agent learns)

Skill idScope
google-agents-cli-workflowEnd-to-end lifecycle, model choice guardrails, code preservation rules
google-agents-cli-adk-codeADK Python patterns—agents, tools, orchestration, callbacks, state
google-agents-cli-scaffoldProject create / enhance / upgrade templates
google-agents-cli-evalMetrics, eval sets, trajectory scoring, LLM-as-judge configuration
google-agents-cli-deployAgent Runtime, Cloud Run, GKE, CI/CD, secrets handling
google-agents-cli-publishGemini Enterprise registration flow
google-agents-cli-observabilityCloud Trace, structured logging, third-party telemetry hooks

Core commands

CommandPurpose
agents-cli setupInstall CLI plus skills into supported coding agents
agents-cli scaffold …Generate or mutate an ADK project tree
agents-cli eval runExecute configured evaluation passes
agents-cli deployShip to selected Google Cloud runtime
agents-cli publish gemini-enterpriseSurface the agent inside Gemini Enterprise
agents-cli login / --statusAuth against Google Cloud or AI Studio

Relationship to ADK

ADK remains the framework: code-first agents, multi-agent graphs, rich tool surfaces (functions, OpenAPI tools, Google Cloud connectors), tracing, and “deploy anywhere” containers. Agents CLI is deliberately not a replacement for Gemini CLI, Claude Code, or Codex—it is the factory line those tools drive when they need opinionated Google Cloud paths for scaffolding, evaluation, and promotion to production.

Cloud vs local

PhaseCloud account?
Create, run locally, author eval assetsNo—an AI Studio API key suffices for Gemini-backed ADK loops
Deploy, central observability, enterprise registryYes—project billing, IAM, and runtime choice (Agent Runtime, Cloud Run, GKE)

Operational checklist

RiskMitigation
Agent-generated code ownershipEnforce human review on agents-cli scaffold enhance diffs; pin dependency versions.
Eval gap before prodRequire agents-cli eval run in CI with frozen eval sets, not only ad-hoc LLM judging.
Secret sprawlUse workload identity + Secret Manager patterns bundled in deploy skill rather than literals in prompts.
Token burnLean on skills for repetitive ADK API lookup instead of re-explaining the framework each session.

For teams already standardising on ADK, Agents CLI is best read as glue automation plus teaching artefacts: it shortens the distance between a natural-language brief and a repo that is structured the way Google’s own agent engineers expect—provided you still treat production gates as human-owned.

Categories
News

MiMo-V2-Flash: Xiaomi’s open MoE bet on agents and long context

Xiaomi’s MiMo-V2-Flash is an open-weight mixture-of-experts language model pitched for reasoning, software engineering, and agentic tool use—released with permissive licensing and same-day open inference code so teams can self-host at throughput-oriented budgets.

Scale and routing

DimensionPublicly quoted profile
Parameter budgetOn the order of 300B+ total parameters with roughly 15B active per token—typical of large sparse MoE stacks built for quality per flop.
ExpertsOrder of 250+ routed experts with a small subset fired each step—keeps memory bandwidth closer to the active footprint than a dense twin would demand.
LicensingMIT open weights, lowering friction for commercial fine-tunes and downstream redistribution.

Architecture: hybrid attention for long jobs

The stack interleaves sliding-window attention with periodic global attention blocks—roughly a five-to-one cadence in public diagrams—so most layers attend locally for speed whilst global layers refresh cross-segment context. Reported training targets sit in the tens of thousands of tokens natively, with product messaging extending usable context toward six-figure token lengths for retrieval-heavy agents.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart TB
    subgraph layer [Representative hybrid block]
        L1[Sliding-window layers] --> L2[Global attention layer]
    end
    T[Token stream] --> layer
    layer --> R[Router → expert MLPs]
    R --> O[Hidden state out]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class T agent
    class L1 hook
    class L2 decision
    class R hook
    class O agent

Post-training: multi-teacher on-policy distillation

The team highlights Multi-Teacher On-Policy Distillation (MOPD)—a recipe meant to dampen the classic post-training seesaw where gains on mathematics or coding can erode safety or vice versa. The idea is to blend specialised teacher signals whilst staying on the student’s own sampling distribution, preserving behaviours that matter for tool-grounded agents.

Serving story

TopicWhat to expect
Inference stackReference kernels and configs shipped for SGLang on launch day—signals that Xiaomi expects the model to run in vLLM-class servers, not only proprietary clouds.
Throughput opticsCommunity benchmarks on modern accelerators quote very high prefill tokens per second when multi-layer multi-token prediction is enabled—treat numbers as hardware-specific until you replicate on your cluster.
MarketplacesThird-party API routers added the model quickly—useful for A/B tests before you commit GPU capital.

Who should adopt first

MiMo-V2-Flash is aimed at teams that want frontier-shaped capability with open weights and a story centred on agentic coding and long-context retrieval. If you only need lightweight instruction models, the operational cost of hosting a 300B-class MoE will be overkill—benchmark on a slice of your real traces before replatforming.

Categories
News

Qwen3.6-27B: dense hybrid attention and thinking preservation

Alibaba’s Qwen team has shipped Qwen3.6-27B—the first dense open-weight entry in the Qwen 3.6 line—combining a hybrid linear-attention backbone, optional preservation of prior chain-of-thought across turns, and a 262K native context window stretchable into the million-token regime with YaRN.

Weights, licence, and runtimes

ArtifactDetail
Hub namesQwen/Qwen3.6-27B (BF16) and Qwen/Qwen3.6-27B-FP8 fine-grained quantisation (block size 128)
LicenceApache 2.0
Documented stacksSGLang ≥0.5.10, vLLM ≥0.19.0, KTransformers, Hugging Face Transformers

Layer geometry

The transformer stacks 64 layers with a repeating 3×(Gated DeltaNet → FFN) + 1×(Gated Attention → FFN) rhythm. Three quarters of sublayers use Gated DeltaNet linear attention (48 value heads / 16 QK heads in the public card), whilst every fourth sublayer uses conventional gated multi-head attention with a reduced KV head count to shrink cache footprint. Feed-forward blocks expand to an intermediate width of 17,408 dimensions.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart LR
    subgraph block [One macro-block]
        D1[DeltaNet] --> D2[DeltaNet]
        D2 --> D3[DeltaNet]
        D3 --> A[Gated attention]
        A --> F[FFN]
    end
    IN[Tokens + optional vision] --> block
    block --> MTP[Multi-token prediction head]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class IN agent
    class D1 hook
    class D2 hook
    class D3 hook
    class A decision
    class F hook
    class MTP agent

Context and “thinking preservation”

Native context is 262,144 tokens; YaRN scaling is advertised up to 1,010,000 tokens for experimental long-document jobs. For multi-turn agents, the release introduces thinking preservation—an API/template flag that keeps earlier chain-of-thought blocks in the visible history so the model does not pay to re-derive the same scratch work each tool round.

Reported benchmark snapshots

BenchmarkScore cited in release materials
SWE-bench Verified77.2
SWE-bench Pro53.5 (above a 397B-parameter MoE from the prior generation in the same table)
Terminal-Bench 2.059.3
QwenWebBench1487
NL2Repo36.2
GPQA Diamond / AIME26 / LiveCodeBench v687.8 / 94.1 / 83.9

Deployment notes

Treat the FP8 checkpoint as the default path when VRAM is tight; validate perplexity and tool-call accuracy on your own eval harness because quantisation interacts badly with brittle JSON tool grammars. Pair the model with a sandbox that mirrors Terminal-Bench-style constraints if you plan to expose shell access—benchmark scores do not substitute for hardened ops reviews.

Categories
News

Open models: MiMo-V2-Flash scale meets Qwen3.6-27B agentic density

Two heavyweight open-model lines moved in the same news cycle: a Chinese phone OEM shipped a very large sparse “flash” language model aimed at reasoning, coding, and agentic workloads, whilst Alibaba’s Qwen team landed a dense 27B multimodal stack tuned for repository-scale agents—with both camps emphasising serving efficiency and day-zero inference tooling.

Two release archetypes

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart TB
    subgraph flash [Sparse flash LLM]
        A1[MoE backbone] --> A2[Hybrid attention]
        A2 --> A3[Post-train distill]
        A3 --> A4[Open weights + SGLang]
    end
    subgraph qwen [Dense hybrid LLM]
        B1[DeltaNet + full attention] --> B2[MTP speculative decode]
        B2 --> B3[Thinking preservation]
        B3 --> B4[HF + vLLM / SGLang]
    end

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A1 agent
    class A2 hook
    class A3 decision
    class A4 hook
    class B1 hook
    class B2 decision
    class B3 agent
    class B4 hook

Sparse “flash” foundation model

DimensionReported shape
ScaleOrder of 300B+ total parameters with on the order of 15B active per token—MoE routing with hundreds of experts and a handful activated per step.
AttentionHybrid layout mixing sliding-window blocks with periodic full-attention blocks; long-context training in the tens of thousands of tokens with extension into six-figure context in product messaging.
Training recipeMulti-teacher on-policy distillation used to reduce the classic post-training trade-off between maths, coding, and safety.
Licensing & deliveryOpen weights under a permissive licence; inference stacks published alongside launch for high-throughput prefill and multi-layer multi-token prediction decode paths.
EcosystemSame-day integration in major open inference engines and third-party API marketplaces—signals that vendors expect agent builders to adopt quickly.

Qwen3.6-27B: dense hybrid for agents

DimensionDetail
Parameters & licence27B dense causal LM with vision encoder; Apache 2.0 open weights.
Weights on hubBF16 and fine-grained FP8 (block size 128) variants with near-parity quality.
Layer pattern64 layers: repeating 3×(Gated DeltaNet → FFN) + 1×(Gated Attention → FFN)—three quarters linear attention, one quarter standard attention for KV-memory savings on long jobs.
DecodingMulti-token prediction trained for speculative decoding at serve time.
Context262,144 tokens native; YaRN extension advertised up to 1,010,000 tokens—team guidance to keep ≥128K when relying on extended “thinking” behaviour.
Thinking preservationOptional template flag to retain prior chain-of-thought across turns—aimed at fewer redundant reasoning tokens and better KV reuse in tool loops.
RuntimeDocumented compatibility floors include SGLang ≥0.5.10, vLLM ≥0.19.0, plus KTransformers and Transformers.

Reported benchmark highlights (Qwen3.6-27B)

BenchmarkReported scoreNotes
SWE-bench Verified77.2Community autonomous coding bar; team positions it near top proprietary coding models.
SWE-bench Pro53.5Above a 397B-parameter MoE from the prior generation on the same table.
Terminal-Bench 2.059.3Heavy sandbox runtime; score quoted on par with a flagship closed coding model in the same write-up.
QwenWebBench1487Internal bilingual web/front-end generation suite—large jump vs earlier 27B baselines.
NL2Repo36.2Repository-level generation metric.
Reasoning samplesGPQA Diamond 87.8; AIME26 94.1; LiveCodeBench v6 83.9Illustrative reasoning and code competition proxies.

Why this pairing matters

One line doubles down on extreme MoE scale + throughput optics for frontier-style workloads; the other shows a mid-size dense hybrid can still punch above much larger MoE predecessors on agentic coding tables while staying deployable on commodity GPU farms. Together they reinforce that 2026 competition is as much about inference economics and tooling as about raw parameter counts.

Categories
AI

Gemini Embedding 2: multimodal vectors for unified retrieval

Gemini Embedding 2 maps text, images, video, audio, and PDFs into one shared vector space so retrieval, clustering, and recommendations can run cross-modally without maintaining separate encoders per modality.

Flow from content to similarity search

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
flowchart LR
    A[Multimodal input] --> B[Embedding model]
    B --> C[Float vector]
    C --> D[Index / ANN]
    Q[Query embedding] --> D
    D --> R[Ranked matches]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A agent
    class B hook
    class C decision
    class D hook
    class Q agent
    class R agent

Model identifiers

SurfaceTypical model stringNotes
Gemini APIgemini-embedding-2-previewPreview track; check current naming in your SDK.
Vertex AIgemini-embedding-2Managed endpoint ID may differ by region—verify in console docs.

Inputs and practical limits

ModalityWhat to expect
TextLonger context than the prior text-only embedding family—on the order of 8k tokens for a single embed request.
ImagesMultiple still images per request (common cap around half a dozen); raster formats such as PNG and JPEG.
VideoShort clips (on the order of two minutes) in widely used container formats.
AudioNative audio embedding without forcing an intermediate transcript.
DocumentsDirect PDF ingestion for small multi-page documents in one call.

Vector size and Matryoshka (MRL) truncation

Default output length is 3072 floats. The family is trained with Matryoshka Representation Learning: early prefix dimensions remain meaningful, so you can request a smaller output_dimensionality (or equivalent in your client) to cut storage and dot-product cost. Typical choices called out in documentation are 768, 1536, and 3072; supported range is roughly 128–3072.

Normalisation for cosine similarity

Full 3072-dimensional vectors are already L2-normalised. If you truncate to other sizes, apply the same normalisation yourself before comparing directions with cosine similarity.

Versus the earlier text-only embedding model

AspectPrior text embedding (stable family)Gemini Embedding 2
ModalitiesText in, dense vector out.Text, image, video, audio, PDF in unified space.
Typical text token budgetShorter (on the order of 2k tokens).Larger (on the order of 8k tokens).
MRL sizingSupported.Supported with the same dimension trade-off mindset.
Best fitText-only RAG and classification.Cross-modal search, mixed media catalogues, multimodal deduplication.

Integration sketch

# Pseudocode — align field names with your official client (REST or SDK).
request = {
  "model": "gemini-embedding-2-preview",
  "contents": multimodal_parts,  # text + optional image/video/audio/pdf parts
  "config": {
    "output_dimensionality": 768,   # optional; omit for full 3072
    "task_type": "RETRIEVAL_DOCUMENT"  # optional hint where supported
  }
}
vector = embed(request).values

Operational summary

CheckAction
Preview driftPin model version strings and re-embed corpora when Google promotes a stable ID.
Index schemaStore dimensionality and normalisation flag per collection; do not mix truncated and full vectors.
Latency costLarge video or multi-image batches increase wall-clock time—batch asynchronously for backfills.
EvaluationBenchmark recall@k on your own queries; public leaderboards do not replace domain-specific retrieval tests.

Use Gemini Embedding 2 when a single vector index must answer text-over-image, image-over-text, or document-plus-audio style queries; keep the prior text-only model when your pipeline is strictly linguistic and you want the smallest integration surface.