Categories
News

GLM-5 Serving Reliability: KV Cache Race Fixes, HiCache Synchronization, and LayerSplit Throughput Gains

This thread explains how Z.ai debugged GLM-5 serving issues that appeared only under high-concurrency, long-context coding-agent workloads, and what fixes restored correctness and improved throughput.

Core production reliability challenge

Scaling laws continue to increase model capability, but production reliability depends on fixing infrastructure-level scaling pain, especially around KV cache correctness and synchronization.

Speculative decoding anomaly signals

Reproducing anomalies under real load

  • Observed anomalies: garbled outputs, repetition, rare-character generation.
  • They appeared only under high-concurrency and long-context coding-agent traffic.
  • Speculative decoding metrics became detection signals: very low spec_accept_length and very high spec_accept_rate.
  • A practical online guardrail was added to terminate and retry suspicious generations based on these metrics.

KV Cache reuse race under PD disaggregation

Decode could abort a timed-out request and reclaim KV cache while Prefill-side writes were still in flight, creating cross-request cache corruption.

KV cache race under PD disaggregation

Fix: reclaim and reuse KV cache only after Prefill confirms writes are safe (none started or all completed). Reported anomaly rate dropped from about 0.1% to below 0.03%.

HiCache read-before-ready issue and synchronization fix

Async cache loading improved throughput, but Forward could start before indexer cache was ready, causing read-before-ready behaviour and downstream abnormal outputs.

HiCache synchronization fix pipeline

Fix: explicit synchronization barrier before launching the indexer kernel; this was submitted upstream to SGLang (PR #22811).

LayerSplit for long-context serving bottlenecks

After correctness fixes, the next bottleneck was Prefill throughput and memory pressure. LayerSplit stores only layer subsets per GPU and overlaps KV broadcast with indexer computation.

LayerSplit layer-wise KV storage
LayerSplit throughput improvement chart

Reported result: throughput improvement ranged from 10% to 132% as context length increased under high cache-hit conditions.

Reliability principles for scaled inference

For large-scale agent serving, throughput, latency, and availability are not enough; infrastructure must also preserve model-state correctness for every generation.

Source

Categories
News

AI Research Skills for Coding Agents: 98 Open-Source Workflows for Training to Deployment

AI Research Skills is an open-source skills library designed to help coding agents execute end-to-end AI research workflows, from ideation and training to evaluation and paper writing.

What the project provides

  • A broad skills catalogue across research orchestration and engineering domains.
  • Domain-specific skills for fine-tuning, distributed training, optimisation, inference, RAG, safety, and multimodal workflows.
  • An autoresearch orchestration layer to route tasks across specialised skills.

Installation and usage

npx @orchestra-research/ai-research-skills
npx @orchestra-research/ai-research-skills list
npx @orchestra-research/ai-research-skills update

The repository documents support for common coding agents and includes category-level installation options via Claude Code marketplace commands.

Why this matters for agentic AI research

  • Reduces setup friction by packaging framework-specific patterns into reusable skills.
  • Improves execution quality for research tasks that need more than generic code generation.
  • Standardises operational knowledge for training, evaluation, and deployment paths.

Source

Categories
News

CC Switch Desktop Manager: Unified Control for Claude Code, Codex and Gemini CLI

CC Switch is an open-source desktop control plane for managing multiple AI coding CLIs from one interface, based on the official project repository.

CC Switch desktop app main interface

What CC Switch solves

  • Unifies provider configuration across Claude Code, Codex, Gemini CLI, OpenCode, and OpenClaw.
  • Removes repeated manual edits across JSON, TOML, and environment files.
  • Centralises MCP servers, prompts, and skills management in one desktop app.

Key capabilities from the repository

  • Provider management: switch providers quickly with preset and custom configurations.
  • Unified MCP panel: manage MCP servers across supported apps with sync support.
  • Prompts and skills: manage prompt presets and install skills from GitHub repositories or ZIP packages.
  • Cross-platform desktop app: Windows, macOS, and Linux support, built with Tauri.
  • Data reliability: SQLite SSOT storage plus atomic writes and backup rotation.

Architecture snapshot

The repository documents a React + TypeScript frontend connected to a Rust/Tauri backend through IPC, with layered services for providers, MCP, proxy, sessions, and config management.

Source

Categories
News

AI Agent Harness Engineering: Why System Design Beats Model Size

Model quality matters, but production outcomes in AI coding systems are increasingly defined by harness quality: prompts, tools, policies, hooks, sandboxes, and verification loops.

Harness engineering overview image from LinkedIn post

Why harness engineering matters

  • An agent is not just the model; it is the model plus runtime scaffolding.
  • Most practical failures are configuration and orchestration failures, not raw model-weight failures.
  • Teams often get better results from a mid-tier model with a strong harness than a top-tier model with weak controls.

Core harness components for coding agents

  • Context layer: system rules, AGENTS.md or equivalent memory files, skill prompts.
  • Execution layer: tools, shell, and sandboxed runtimes with clear allow-lists.
  • Control layer: planning, subagent delegation, and step-wise verification.
  • Safety layer: hooks for lint/test gates and destructive command blocking.
  • Learning layer: ratchet principle where each recurring failure becomes an enforceable rule.
Diagram showing model plus harness components

Practical implementation pattern

  1. Start from desired behaviour (e.g., safe commits, complete tests, predictable output format).
  2. Map each behaviour to one harness element (prompt rule, hook, tool policy, or evaluator).
  3. Run short loops with automated checks and compact context for long tasks.
  4. Record repeated failures and codify them as constraints in the harness.
Mapping desired behaviours to harness design elements

Source

Categories
News

MCP vs Tool Calling vs Skills: When to Use Each Layer in LLM Agent Architecture

Tool Calling, MCP, and Skills are often grouped together, but they solve different problems at different layers. If you treat them as interchangeable, architecture gets noisy fast.

Layer 1: Tool Calling (the primitive)

Tool Calling is the direct execution primitive: you define a schema, the model emits a structured call, and your runtime executes deterministic code.

  • Best fit: small toolsets inside one application.
  • Strength: precise control and tight runtime coupling.
  • Trade-off: prompt/context bloat as tool count grows.

Layer 2: MCP (the protocol contract)

MCP standardises how clients discover and use external capabilities from tool servers. It addresses cross-app and cross-client reuse much better than hardcoded in-app tools.

  • Best fit: team-wide integrations and reuse across multiple clients.
  • Strength: portability and discovery.
  • Trade-off: operational overhead (hosting, auth, lifecycle management).

Layer 3: Skills (the capability playbook)

Skills package instructions, scripts, examples, and conventions into reusable expertise. They are especially effective when how to execute matters as much as what to execute.

  • Best fit: repeatable multi-step workflows and team conventions.
  • Strength: progressive loading and reusable process knowledge.
  • Trade-off: quality depends heavily on disciplined authoring.

A practical selection rule

  1. Use Tool Calling for deterministic in-app execution.
  2. Add MCP when integrations must be shared across clients or teams.
  3. Use Skills to encode execution quality, standards, and workflow behaviour.

The strongest production systems usually combine all three layers instead of forcing every problem into one abstraction.

Recreated comparison visual

flowchart TB
    A[Tool Calling
Primitive execution] --> D[Use when
1-10 tools
in one app]
    B[MCP
Service contract protocol] --> E[Use when
cross-app reuse
team integrations]
    C[Skills
Capability playbook] --> F[Use when
workflow quality
and conventions matter]

    A -. layer .-> B
    B -. layer .-> C

    classDef col1 fill:#E3F2FD,stroke:#1E88E5,color:#0D47A1;
    classDef col2 fill:#E8F5E9,stroke:#43A047,color:#1B5E20;
    class A,B,C col1;
    class D,E,F col2;

Source

Original LinkedIn post

Categories
News

AI Agent Architecture Explained: The 7-Layer Loop Behind Modern Agent Systems

Across most modern agent frameworks, the implementation details differ but the operational loop is strikingly similar. The practical differences usually come from how deeply each layer is engineered, not from the marketing label of the framework.

AI agent architecture: seven-layer operating loop

flowchart LR
    P[Perceive\nInput, trigger, events]
    M[Remember\nShort-term + long-term memory]
    T[Think\nReasoning over context]
    PL[Plan\nTask decomposition]
    A[Act\nTool/API execution]
    O[Observe\nTracing, logs, metrics]

    P --> M --> T --> PL --> A --> O --> P

    G[Guardrails\nPolicy, approvals, HITL, filters]
    G -. governs .- P
    G -. governs .- M
    G -. governs .- T
    G -. governs .- PL
    G -. governs .- A
    G -. governs .- O

    classDef core fill:#E8F5E9,stroke:#2E7D32,stroke-width:1.5px,color:#1B5E20;
    classDef guard fill:#FFF3E0,stroke:#EF6C00,stroke-width:1.5px,color:#E65100;

    class P,M,T,PL,A,O core;
    class G guard;

What each layer does in practice

  • Perceive: receives user messages, API triggers, or system events.
  • Remember: keeps active state and retrieves persistent knowledge.
  • Think: reasons over input plus memory to choose the next move.
  • Plan: breaks larger goals into executable sub-steps.
  • Act: invokes tools, APIs, code execution, file/database operations.
  • Observe: captures traces, latency, cost, and outcome quality signals.
  • Guardrails (cross-cutting): enforces policy, permission boundaries, and human approval checkpoints.

Why this model is useful

Once you recognise the loop, framework selection becomes a trade-off analysis: memory quality, planning reliability, tooling depth, and observability maturity. Production performance usually depends on those layer depths rather than framework naming.

Reference

Original LinkedIn source

Categories
News

Claude Code Weekend Project: 5 Fast Upgrades with Skills, Hooks, MCP, Subagents, and CLAUDE.md

If you want a high-impact Claude Code weekend, focus on a lightweight operating layer rather than heavy architecture. These five upgrades can usually be shipped quickly and improve consistency, safety, and reuse across your repo.

The 5-step weekend upgrade path

flowchart TB
    A[Step 1: Create Skill\nSKILL.md] --> B[Step 2: Add Hook\nPre-commit guardrails]
    B --> C[Step 3: Connect MCP\nExternal tools]
    C --> D[Step 4: Build Subagent\nIsolated heavy tasks]
    D --> E[Step 5: Write CLAUDE.md\nPersistent project memory]

    A:::step
    B:::step
    C:::step
    D:::step
    E:::step

    classDef step fill:#E8F1FF,stroke:#3B82F6,stroke-width:1.5px,color:#0F172A;

1) Create your first Skill

Start with one reusable routine your team repeats often: PR review checklist, deployment checklist, or test-writing standard. A single well-scoped skill reduces repeated prompting and makes outputs more consistent.

2) Add a pre-commit Hook

Hooks enforce non-negotiable rules even when chat context drifts. A simple secret-leak check is a strong first step.

# .claude/hooks/pre-commit.sh
if git diff --cached --name-only | grep -qE '\\.env$'; then
  echo "BLOCKED: .env file detected"
  exit 1
fi

3) Connect one MCP server

Pick one integration that removes frequent context switching. GitHub MCP is usually a fast win for reading issues, checking PR context, and tracking CI state from the same working flow.

4) Add one focused subagent

Use subagents for noisy or long-running tasks (test runs, log triage, dependency scans). Keep the main session clean by returning only concise summaries and action items.

5) Keep CLAUDE.md lean and current

  • How to run the project (commands, ports, env requirements)
  • Code conventions that actually matter in your repo
  • What not to touch (legacy/third-party or sensitive areas)
  • Current priorities and active workstreams

When these five are in place, Claude Code behaves less like an ad hoc assistant and more like an operational engineering system with guardrails and reusable patterns.

Categories
News

Six Core Primitives Behind Most AI Agent Frameworks: A Practical Evaluation Guide

Many agent frameworks look different on the surface but rely on a similar set of building blocks. Treating these as shared primitives can make framework evaluation more practical and less marketing-driven.

Agent framework six primitives chart

Six primitives that show up across agent frameworks

  • Context window: the available working memory per inference.
  • System prompt: always-in-context instructions for identity, constraints, and behaviour.
  • Skills: domain procedures or knowledge loaded on demand.
  • Tool interface: external system access layer (often MCP-style integrations).
  • Memory: persistent state across sessions, frequently paired with retrieval.
  • Sub-agents: parallel or delegated threads to avoid blocking the main loop.

How to evaluate frameworks with this lens

  • Check how each primitive is implemented, not just how it is named.
  • Measure observability and failure recovery around tool calls and memory writes.
  • Validate real-world usability: permissions, debugging flow, and maintainability.
  • Prioritise predictable execution and integration quality over branding.

Reference: https://www.linkedin.com/posts/most-agent-frameworks-are-just-renaming-the-share-7453461567142105088-LWYM/

Categories
News

GenericAgent Explained: Self-Evolving AI Agent Framework with Local System Control

GenericAgent is an open-source autonomous agent framework that focuses on self-evolving skills rather than large preloaded workflows. The project claims a compact core architecture, with a small agent loop plus a minimal set of atomic tools for browser, terminal, filesystem, keyboard/mouse, vision, and ADB-based mobile control.

GenericAgent self-evolving framework overview

Why GenericAgent is getting attention

  • Self-evolving skill tree: solved tasks are converted into reusable skills for later runs.
  • Minimal core: the project positions itself around a compact codebase and lightweight loop design.
  • Cross-model support: designed for multiple LLM providers via OpenAI-compatible endpoints.
  • Cross-platform aim: targets Windows, macOS, Linux, and Android (Termux/ADB workflows).

How the framework works

The main pattern is: execute a new task end-to-end, persist the successful path as a skill, and reuse that skill when a similar request appears. Over time, this can reduce repeated planning overhead and make the same instance more specialised for its owner’s workflows.

According to the project documentation, the design combines layered memory with a small toolset, and uses runtime extension through code execution when new capabilities are required.

Practical takeaways for AI agent builders

  • Persistent skill accumulation may be a strong path for long-term personalisation.
  • Smaller core loops can improve maintainability if tool boundaries remain clear.
  • Token-efficiency claims are compelling, but should be benchmarked against your own workload and safety requirements.
  • System-level control requires strong guardrails, approvals, and secure execution policies in production environments.

Project page: github.com/lsdefine/GenericAgent. The repository also links a technical report on arXiv for deeper details on methodology and evaluation.

Categories
News

Claude Organic App Discovery: How AI App Distribution Is Changing in 2026

Claude connectors and app discovery are evolving quickly, and teams should focus on public product capabilities and documented integration behaviour when planning distribution strategy.

Publicly available updates on Claude connectors and discovery

  • Anthropic announced Integrations (remote MCP support) so Claude can connect to external apps and services, not just local desktop MCP servers.
  • Claude introduced a connectors directory for browsing and connecting supported tools.
  • Claude documentation states that directory connectors can appear as in-chat suggested connectors when relevant to a user task.
  • The directory documentation describes ranking as usage-based, similar to app-store style discovery.
  • Official help/docs also highlight security and permission controls, including individual authentication and policy/terms coverage for listed connectors.

Practical implications for teams

  • Focus on connector quality, reliability, and clear task fit.
  • Optimise descriptions/use-cases so users can quickly identify relevance.
  • Track user adoption and successful task completion to improve connector performance over time.
  • Review directory policy and terms before submitting connectors.

Public sources