Categories
News

AI Agent Architecture Explained: The 7-Layer Loop Behind Modern Agent Systems

Across most modern agent frameworks, the implementation details differ but the operational loop is strikingly similar. The practical differences usually come from how deeply each layer is engineered, not from the marketing label of the framework.

AI agent architecture: seven-layer operating loop

flowchart LR
    P[Perceive\nInput, trigger, events]
    M[Remember\nShort-term + long-term memory]
    T[Think\nReasoning over context]
    PL[Plan\nTask decomposition]
    A[Act\nTool/API execution]
    O[Observe\nTracing, logs, metrics]

    P --> M --> T --> PL --> A --> O --> P

    G[Guardrails\nPolicy, approvals, HITL, filters]
    G -. governs .- P
    G -. governs .- M
    G -. governs .- T
    G -. governs .- PL
    G -. governs .- A
    G -. governs .- O

    classDef core fill:#E8F5E9,stroke:#2E7D32,stroke-width:1.5px,color:#1B5E20;
    classDef guard fill:#FFF3E0,stroke:#EF6C00,stroke-width:1.5px,color:#E65100;

    class P,M,T,PL,A,O core;
    class G guard;

What each layer does in practice

  • Perceive: receives user messages, API triggers, or system events.
  • Remember: keeps active state and retrieves persistent knowledge.
  • Think: reasons over input plus memory to choose the next move.
  • Plan: breaks larger goals into executable sub-steps.
  • Act: invokes tools, APIs, code execution, file/database operations.
  • Observe: captures traces, latency, cost, and outcome quality signals.
  • Guardrails (cross-cutting): enforces policy, permission boundaries, and human approval checkpoints.

Why this model is useful

Once you recognise the loop, framework selection becomes a trade-off analysis: memory quality, planning reliability, tooling depth, and observability maturity. Production performance usually depends on those layer depths rather than framework naming.

Reference

Original LinkedIn source

Categories
News

Claude Code Weekend Project: 5 Fast Upgrades with Skills, Hooks, MCP, Subagents, and CLAUDE.md

If you want a high-impact Claude Code weekend, focus on a lightweight operating layer rather than heavy architecture. These five upgrades can usually be shipped quickly and improve consistency, safety, and reuse across your repo.

The 5-step weekend upgrade path

flowchart TB
    A[1. Create Skill\nSKILL.md] --> B[2. Add Hook\nPre-commit guardrails]
    B --> C[3. Connect MCP\nExternal tools]
    C --> D[4. Build Subagent\nIsolated heavy tasks]
    D --> E[5. Write CLAUDE.md\nPersistent project memory]

    A:::step
    B:::step
    C:::step
    D:::step
    E:::step

    classDef step fill:#E8F1FF,stroke:#3B82F6,stroke-width:1.5px,color:#0F172A;

1) Create your first Skill

Start with one reusable routine your team repeats often: PR review checklist, deployment checklist, or test-writing standard. A single well-scoped skill reduces repeated prompting and makes outputs more consistent.

2) Add a pre-commit Hook

Hooks enforce non-negotiable rules even when chat context drifts. A simple secret-leak check is a strong first step.

# .claude/hooks/pre-commit.sh
if git diff --cached --name-only | grep -qE '\\.env$'; then
  echo "BLOCKED: .env file detected"
  exit 1
fi

3) Connect one MCP server

Pick one integration that removes frequent context switching. GitHub MCP is usually a fast win for reading issues, checking PR context, and tracking CI state from the same working flow.

4) Add one focused subagent

Use subagents for noisy or long-running tasks (test runs, log triage, dependency scans). Keep the main session clean by returning only concise summaries and action items.

5) Keep CLAUDE.md lean and current

  • How to run the project (commands, ports, env requirements)
  • Code conventions that actually matter in your repo
  • What not to touch (legacy/third-party or sensitive areas)
  • Current priorities and active workstreams

When these five are in place, Claude Code behaves less like an ad hoc assistant and more like an operational engineering system with guardrails and reusable patterns.

Categories
News

Six Core Primitives Behind Most AI Agent Frameworks: A Practical Evaluation Guide

Many agent frameworks look different on the surface but rely on a similar set of building blocks. Treating these as shared primitives can make framework evaluation more practical and less marketing-driven.

Agent framework six primitives chart

Six primitives that show up across agent frameworks

  • Context window: the available working memory per inference.
  • System prompt: always-in-context instructions for identity, constraints, and behaviour.
  • Skills: domain procedures or knowledge loaded on demand.
  • Tool interface: external system access layer (often MCP-style integrations).
  • Memory: persistent state across sessions, frequently paired with retrieval.
  • Sub-agents: parallel or delegated threads to avoid blocking the main loop.

How to evaluate frameworks with this lens

  • Check how each primitive is implemented, not just how it is named.
  • Measure observability and failure recovery around tool calls and memory writes.
  • Validate real-world usability: permissions, debugging flow, and maintainability.
  • Prioritise predictable execution and integration quality over branding.

Reference: https://www.linkedin.com/posts/most-agent-frameworks-are-just-renaming-the-share-7453461567142105088-LWYM/

Categories
News

GenericAgent Explained: Self-Evolving AI Agent Framework with Local System Control

GenericAgent is an open-source autonomous agent framework that focuses on self-evolving skills rather than large preloaded workflows. The project claims a compact core architecture, with a small agent loop plus a minimal set of atomic tools for browser, terminal, filesystem, keyboard/mouse, vision, and ADB-based mobile control.

GenericAgent self-evolving framework overview

Why GenericAgent is getting attention

  • Self-evolving skill tree: solved tasks are converted into reusable skills for later runs.
  • Minimal core: the project positions itself around a compact codebase and lightweight loop design.
  • Cross-model support: designed for multiple LLM providers via OpenAI-compatible endpoints.
  • Cross-platform aim: targets Windows, macOS, Linux, and Android (Termux/ADB workflows).

How the framework works

The main pattern is: execute a new task end-to-end, persist the successful path as a skill, and reuse that skill when a similar request appears. Over time, this can reduce repeated planning overhead and make the same instance more specialised for its owner’s workflows.

According to the project documentation, the design combines layered memory with a small toolset, and uses runtime extension through code execution when new capabilities are required.

Practical takeaways for AI agent builders

  • Persistent skill accumulation may be a strong path for long-term personalisation.
  • Smaller core loops can improve maintainability if tool boundaries remain clear.
  • Token-efficiency claims are compelling, but should be benchmarked against your own workload and safety requirements.
  • System-level control requires strong guardrails, approvals, and secure execution policies in production environments.

Project page: github.com/lsdefine/GenericAgent. The repository also links a technical report on arXiv for deeper details on methodology and evaluation.

Categories
News

Claude Organic App Discovery: How AI App Distribution Is Changing in 2026

Claude connectors and app discovery are evolving quickly, and teams should focus on public product capabilities and documented integration behaviour when planning distribution strategy.

Publicly available updates on Claude connectors and discovery

  • Anthropic announced Integrations (remote MCP support) so Claude can connect to external apps and services, not just local desktop MCP servers.
  • Claude introduced a connectors directory for browsing and connecting supported tools.
  • Claude documentation states that directory connectors can appear as in-chat suggested connectors when relevant to a user task.
  • The directory documentation describes ranking as usage-based, similar to app-store style discovery.
  • Official help/docs also highlight security and permission controls, including individual authentication and policy/terms coverage for listed connectors.

Practical implications for teams

  • Focus on connector quality, reliability, and clear task fit.
  • Optimise descriptions/use-cases so users can quickly identify relevance.
  • Track user adoption and successful task completion to improve connector performance over time.
  • Review directory policy and terms before submitting connectors.

Public sources

Categories
News

5 Long-Running AI Agent Design Patterns for Production: State, Memory, Approvals, and Orchestration

Long-running AI agents are moving from short demos to production workloads that run for days. This article breaks down the five design patterns shown in the shared PDF and explains how to apply them when building resilient, governable agent systems.

Agent design patterns for long-running AI agents cover

What this PDF covers

  • Checkpoint-and-Resume for failure recovery
  • Delegated Approval for human gates without losing execution state
  • Memory-Layered Context with identity and governance controls
  • Ambient Processing for event-driven autonomous work
  • Fleet Orchestration for specialist agents coordinated over long horizons

Pattern 1: Checkpoint-and-Resume

The core idea is simple: persist progress at stable boundaries, then resume from the latest checkpoint instead of restarting the full workflow. In practice, this reduces wasted compute and avoids repeating already completed tool calls.

  • Persist state after meaningful batches, not only at the end
  • Store enough context to re-enter the flow deterministically
  • Design retries to be idempotent so resume does not duplicate side effects
Checkpoint-and-resume pattern diagram

Pattern 2: Delegated Approval

For long tasks with compliance or business checkpoints, the agent should pause in place for human review. The key advantage is preserving full execution state while awaiting approval, so the system can restart quickly and continue the exact trajectory.

  • Pause execution at explicit review gates
  • Retain reasoning trail, tool history, and task context
  • Resume only after approve/revise decisions from reviewers
Delegated approval workflow diagram

Pattern 3: Memory-Layered Context

Long-running systems need tiered memory instead of a single context bucket. Session memory supports the active task, while persistent memory captures longer-lived learnings. Governance layers decide what can be written, read, and propagated.

  • Separate short-lived session context from durable memory
  • Use agent identity and policy controls for memory access
  • Audit memory mutations to reduce drift and leakage
Memory layered context architecture

Pattern 4: Ambient Processing

Not all agents wait for prompts. Ambient agents subscribe to event streams and process work continuously under policy controls. This model fits moderation, monitoring, enrichment, and routing pipelines where latency and consistency matter.

  • Trigger on events from queues or data streams
  • Apply central policy checks before downstream actions
  • Run in secure sandboxes for sustained autonomous operation
Ambient processing event-driven flow

Pattern 5: Fleet Orchestration

Production deployments often involve a coordinator with multiple specialist agents. Each specialist can have different runtime windows, responsibilities, and policy constraints, while a central orchestrator handles delegation and consolidation.

  • Use a coordinator for task decomposition and hand-offs
  • Assign specialists by capability, duration, and risk profile
  • Route all inter-agent communication through a controlled gateway
Fleet orchestration multi-agent architecture

Include the original PDF

The full source PDF is embedded below for direct reading and reference.

https://mer.vin/wp-content/uploads/2026/04/128aae08-agent-design-patterns-long-running.pdf

Practical implementation checklist

  • Define explicit checkpoint boundaries and replay-safe tool calls
  • Model human approval as durable wait states
  • Implement memory governance with identity, policy, and audit logs
  • Adopt event-driven runners for ambient autonomous tasks
  • Use coordinator-specialist orchestration with strict gateway controls
Categories
PraisonAI

Agent Runtime Providers: 9 Examples Across LLM and Cloud Sandbox Backends

PraisonAI’s managed agents let the agent loop run on one provider and the tool execution run on another — a 9-example pack now covers every LLM backend and every cloud sandbox, each in ~30 lines of agent-centric code.

Two-Layer Architecture

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
    U[User prompt] --> A[Agent]
    A --> L{LLM Provider}
    L -->|Anthropic| L1[Claude]
    L -->|OpenAI| L2[GPT-4o]
    L -->|Gemini| L3[Gemini 2.0]
    L -->|Ollama| L4[Self-hosted]
    A --> C{Cloud Sandbox}
    C -->|E2B| C1[E2B VM]
    C -->|Modal| C2[Modal Function]
    C -->|Fly.io| C3[Fly Machine]
    C -->|Daytona| C4[Daytona Workspace]
    C -->|Docker| C5[Local Container]
    C1 --> R[Tool Result]
    C2 --> R
    C3 --> R
    C4 --> R
    C5 --> R
    R --> A
    A --> O[Answer]

    classDef agent fill:#8B0000,color:#fff
    classDef hook fill:#189AB4,color:#fff
    classDef decision fill:#444,color:#fff

    class A,U,O agent
    class L1,L2,L3,L4,C1,C2,C3,C4,C5,R hook
    class L,C decision

LLM Providers — 4 Examples

The first four examples use the ManagedAgent(provider=...) factory. Anthropic is the only provider where the entire agent loop (LLM, tools, memory, sessions) runs on hosted infrastructure — the others run the loop locally and send LLM calls to the chosen cloud.

FileProviderFull loop in cloud?ModelRequired env
runtime_anthropic.pyAnthropicYesclaude-3-5-sonnet-latestANTHROPIC_API_KEY + pip install anthropic
runtime_openai.pyOpenAINo (LLM only)gpt-4o-miniOPENAI_API_KEY
runtime_gemini.pyGoogleNo (LLM only)gemini/gemini-2.0-flash-expGEMINI_API_KEY
runtime_ollama.pySelf-hostedNo (LLM only)llama3.2Ollama at localhost:11434

Anthropic — full remote loop

from praisonai import Agent, ManagedAgent, ManagedConfig

managed = ManagedAgent(
    provider="anthropic",
    config=ManagedConfig(
        model="claude-3-5-sonnet-latest",
        system="You are a concise coding assistant.",
        name="AnthropicRuntimeAgent",
    ),
)
agent = Agent(name="anthropic-runtime", backend=managed)

print(agent.start("Write a Python one-liner that sums 1..10."))
print(agent.start("Now change it to factorial of 5."))

info = managed.retrieve_session()
print(f"session={managed.session_id}  in={info['usage']['input_tokens']}  out={info['usage']['output_tokens']}")

OpenAI / Gemini / Ollama — swap the provider

from praisonai import Agent, ManagedAgent, LocalManagedConfig

managed = ManagedAgent(
    provider="gemini",  # or "openai", "ollama"
    config=LocalManagedConfig(
        model="gemini/gemini-2.0-flash-exp",
        system="You are a concise assistant.",
        name="GeminiRuntimeAgent",
    ),
)
agent = Agent(name="gemini-runtime", backend=managed)
print(agent.start("What does REST stand for?"))

Cloud Sandbox Providers — 5 Examples

The next five examples use SandboxedAgent(compute=...) — the agent loop runs locally but every tool / code execution is isolated in a cloud sandbox. Same API shape for every provider; swap one string to change the backend.

FileCloudcompute=What it gives youRequired env
runtime_e2b.pyE2B"e2b"Ephemeral VM, sub-second bootE2B_API_KEY + pip install e2b
runtime_modal.pyModal"modal"Serverless functions, per-second billingmodal token set
runtime_fly.pyFly.io"flyio"Machines API, global regions, GPU optionFLY_API_TOKEN
runtime_daytona.pyDaytona"daytona"Cloud dev workspacesDAYTONA_API_KEY
runtime_docker.pyDocker (local)"docker"Local container isolation for CI / devDocker daemon

Same pattern for every cloud

import asyncio
from praisonai import Agent
from praisonai.integrations import SandboxedAgent, SandboxedAgentConfig

async def main():
    sandboxed = SandboxedAgent(
        compute="flyio",  # or "e2b", "modal", "daytona", "docker"
        config=SandboxedAgentConfig(
            model="gpt-4o-mini",
            system="You are a concise coding assistant.",
            name="FlyioRuntimeAgent",
        ),
    )
    agent = Agent(name="fly-runtime", backend=sandboxed)

    info = await sandboxed.provision_compute(
        image="python:3.12-slim", cpu=1, memory_mb=512, idle_timeout_s=120,
    )
    print(f"Machine: {info.instance_id} ({info.status})")

    result = await sandboxed.execute_in_compute("python3 -c 'print(2 ** 20)'")
    print(f"Cloud compute: {result['stdout'].strip()}")

    print("Agent:", agent.start("What is 2**20? Just the number."))

    await sandboxed.shutdown_compute()

asyncio.run(main())

Graceful Auto-Skip — Runs in Any Environment

Every example checks for its required credentials and service availability before heavy imports, then exits 0 with a skip message when they’re missing. You can clone the repo and run all_runtimes.py on any machine without errors.

import os
if not os.getenv("OPENAI_API_KEY"):
    print("[skip] OPENAI_API_KEY not set.")
    raise SystemExit(0)
if not os.getenv("E2B_API_KEY"):
    print("[skip] E2B_API_KEY not set.")
    raise SystemExit(0)

Summary

MetricValue
Total examples9 (4 LLM + 5 cloud)
Lines per example25 – 55
Cold run (all 9, no creds)1.3 s — every example auto-skips
Core SDK regression289 / 289 agent tests pass
API surfaceChange one string to swap cloud / LLM

Run It

# All 9 (skip mode, no creds):
python examples/python/managed-agents/provider/all_runtimes.py

# Run individual examples with real creds:
OPENAI_API_KEY=sk-... python .../runtime_openai.py
ANTHROPIC_API_KEY=sk-... python .../runtime_anthropic.py
GEMINI_API_KEY=... python .../runtime_gemini.py
OPENAI_API_KEY=sk-... E2B_API_KEY=... python .../runtime_e2b.py
OPENAI_API_KEY=sk-... FLY_API_TOKEN=... python .../runtime_fly.py

All examples live under examples/python/managed-agents/provider/ in the PraisonAI repo.

Categories
News

DeepSeek V4 Preview Explained: 1M Context Architecture, Benchmarks, Pricing, and Enterprise Adoption Guide

DeepSeek has released a preview of its V4 family. This article summarises the launch and key technical details from official sources.

DeepSeek V4 Preview launch visual from DeepSeek X announcement

What was announced (verified from DeepSeek launch sources)

  • DeepSeek-V4-Pro: 1.6T total parameters, 49B activated (MoE).
  • DeepSeek-V4-Flash: 284B total parameters, 13B activated (MoE).
  • Context window: up to 1M tokens for both models.
  • Launch posture: open-sourced preview, API availability on launch day, and web/app availability.
  • Technical report + open weights: both linked from the launch post and model collection.

Architecture and systems changes that matter

Across the model card and V4 report, DeepSeek attributes the long-context jump to a combined architecture-and-systems stack rather than one isolated trick.

ComponentWhat it doesClaimed impact
CSA + HCA hybrid attentionCombines compressed sparse attention with heavily compressed attentionAt 1M context: ~27% single-token FLOPs and ~10% KV cache vs DeepSeek-V3.2
mHC (Manifold-Constrained Hyper-Connections)Strengthens residual pathways for deeper, stabler signal propagationImproved stability while preserving model expressivity
Muon optimizer + FP4/FP8 strategyFaster/stabler training with mixed precision and quantisation-aware pathsLower cost to train/serve long-context MoE models
Post-training pipelineSpecialist training + on-policy distillationCombines domain specialists into one unified general model

Model and API economics snapshot

DeepSeek API pricing currently positions Flash as the economical default and Pro as the higher-capability option.

ModelContextInput (cache hit) / 1MInput (cache miss) / 1MOutput / 1M
deepseek-v4-flash1M$0.028$0.14$0.28
deepseek-v4-pro1M$0.145$1.74$3.48

Compatibility note from DeepSeek docs: deepseek-chat and deepseek-reasoner map to V4-Flash modes and are planned for deprecation.

Performance snapshot from official artefacts

AreaPositive signal from official artefactsSource note
Coding and agentic tasksStrong published numbers across LiveCodeBench, SWE, Terminal Bench and Toolathlon slicesBenchmark conditions vary by reasoning mode and harness; see source tables for exact setup
Long context1M-context benchmarks and explicit KV/FLOP efficiency claims in report/model cardResults are from DeepSeek release artefacts and should be interpreted with stated evaluation settings
Reasoning mode controlsFlash/Pro with non-think, think-high, think-max pathsReasoning effort mode and output length affect cost and latency

Adoption playbook (practical)

  • Step 1: Start with deepseek-v4-flash for broad traffic and cost control.
  • Step 2: Route hard tasks (complex coding, deep research, high-stakes workflows) to deepseek-v4-pro.
  • Step 3: Add evaluation gates for long-context faithfulness, tool-use reliability, and cost per resolved task.
  • Step 4: Re-check model pricing and deprecation notices frequently; V4 is in preview-phase cadence.

Minimal API switch example

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="https://api.deepseek.com")

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role":"user","content":"Summarise this 300-page contract and flag risk clauses."}]
)
print(resp.choices[0].message.content)

API migration and deprecation details

DeepSeek V4 thread image with API migration and deprecation details
  • API migration detail (6/n): keep the same base_url and switch model IDs to deepseek-v4-pro or deepseek-v4-flash; OpenAI ChatCompletions and Anthropic-style APIs are both supported.
  • Thinking mode note (6/n): both models support dual modes (Thinking / Non-Thinking) with guidance at DeepSeek API docs.
  • Deprecation timeline (6/n): deepseek-chat and deepseek-reasoner are scheduled to retire after 24 Jul 2026, 15:59 UTC, currently mapped to V4-Flash modes.
  • Trust/sourcing note (7/n): DeepSeek explicitly asks users to rely on official DeepSeek channels for announcements.

Primary sources

Categories
News

OpenMythos Explained: Recurrent-Depth Transformer Architecture, Stability, and Parameter Efficiency

OpenMythos is an open-source PyTorch reconstruction hypothesis for Claude Mythos that explores whether recurrent-depth transformers can deliver stronger reasoning depth without linearly scaling parameter count.

OpenMythos recurrent-depth transformer architecture overview

Core architecture idea

The proposed layout is Prelude → Recurrent Block → Coda. The recurrent block is looped multiple times in one forward pass, so depth can be increased at inference time by running more loops.

h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
  • h_t: latent state at loop step t
  • e: encoded input from the Prelude, re-injected every loop
  • A, B: learned matrices controlling carry-over and input injection
  • Looping increases reasoning depth without adding a fresh full layer stack each time

What makes this design interesting

ComponentRoleWhy it matters
Recurrent-depth loopingReuses a shared block for multiple stepsInference-time depth control instead of fixed-depth-only reasoning
MoE feed-forward routingActivates sparse experts per token and per loopHigher capacity efficiency with bounded active compute
LTI-style stability constraint (ρ(A) < 1)Controls hidden-state growthReduces residual explosion risk in deep loops
Adaptive haltingStops looping once a token has convergedAvoids overthinking and saves compute on easier positions
Depth-wise LoRA adaptersAdds small per-depth adaptationsAllows depth-specific behaviour with low parameter overhead

Practical takeaway

If this class of architecture holds up in broader experiments, it suggests a shift from “bigger model only” scaling to a hybrid of moderate parameters + adaptive inference depth. That could improve cost-efficiency for long-horizon reasoning workloads and agentic tasks.

Primary sources

Categories
News

GPT-5.5: All Images and Benchmark Tables (Dedicated Post)

Dedicated extraction post from OpenAI’s Introducing GPT-5.5 page, including every accessible article image and all benchmark tables.

Algebraic geometry surface intersection visual from OpenAI GPT-5.5 post

Topline benchmark snapshot

MetricGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
Terminal-Bench 2.082.7%75.1%69.4%68.5%
Expert-SWE (Internal)73.1%68.5%
GDPval (wins or ties)84.9%83.0%82.3%82.0%80.3%67.3%
OSWorld-Verified78.7%75.0%78.0%
Toolathlon55.6%54.6%48.8%
BrowseComp84.4%82.7%90.1%89.3%79.3%85.9%
FrontierMath Tier 1–351.7%47.6%52.4%50.0%43.8%36.9%
FrontierMath Tier 435.4%27.1%39.6%38.0%22.9%16.7%
CyberGym81.8%79.0%73.1%

Coding

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
SWE-Bench Pro (Public) *58.6%57.7%64.3%*54.2%
Terminal-Bench 2.082.7%75.1%69.4%68.5%
Expert-SWE (Internal)73.1%68.5%

Professional

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
GDPval (wins or ties)84.9%83.0%82.3%82.0%80.3%67.3%
FinanceAgent v1.160.0%56.0%61.5%64.4%59.7%
Investment Banking Modeling Tasks (Internal)88.5%87.3%88.6%83.6%
OfficeQA Pro54.1%53.2%43.6%18.1%

Computer use and vision

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
OSWorld-Verified78.7%75.0%78.0%
MMMU Pro (no tools)81.2%81.2%80.5%
MMMU Pro (with tools)83.2%82.1%

Tool use

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
BrowseComp84.4%82.7%90.1%89.3%79.3%85.9%
MCP Atlas**75.3%70.6%79.1%78.2%
Toolathlon55.6%54.6%48.8%
Tau2-bench Telecom*** (original prompts)98.0%92.8%

Academic

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
GeneBench25.0%19.0%33.2%25.6%
FrontierMath Tier 1–351.7%47.6%52.4%50.0%43.8%36.9%
FrontierMath Tier 435.4%27.1%39.6%38.0%22.9%16.7%
BixBench80.5%74.0%
GPQA Diamond93.6%92.8%94.4%94.2%94.3%
Humanity’s Last Exam (no tools)41.4%39.8%43.1%42.7%46.9%44.4%
Humanity’s Last Exam (with tools)52.2%52.1%57.2%58.7%54.7%51.4%

Cybersecurity

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
Capture-the-Flags challenge tasks (Internal)****88.1%83.7%
CyberGym81.8%79.0%73.1%

Long context

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
Graphwalks BFS 256k f173.7%62.5%76.9%
Graphwalks BFS 1mil f145.4%9.4%41.2% (Opus 4.6)
Graphwalks parents 256k f190.1%82.8%93.6%
Graphwalks parents 1mil f158.5%44.4%72.0% (Opus 4.6)
OpenAI MRCR v2 8-needle 4K-8K98.1%97.3%
OpenAI MRCR v2 8-needle 8K-16K93.0%91.4%
OpenAI MRCR v2 8-needle 16K-32K96.5%97.2%
OpenAI MRCR v2 8-needle 32K-64K90.0%90.5%
OpenAI MRCR v2 8-needle 64K-128K83.1%86.0%
OpenAI MRCR v2 8-needle 128K-256K87.5%79.3%59.2%
OpenAI MRCR v2 8-needle 256K-512K81.5%57.5%
OpenAI MRCR v2 8-needle 512K-1M74.0%36.6%32.2%

Abstract reasoning

EvalGPT-5.5GPT-5.4GPT-5.5 ProGPT-5.4 ProClaude Opus 4.7Gemini 3.1 Pro
ARC-AGI-1 (Verified)95.0%93.7%94.5%93.5%98.0%
ARC-AGI-2 (Verified)85.0%73.3%83.3%75.8%77.1%

Source

https://openai.com/index/introducing-gpt-5-5/