AGENT SCAFFOLDING // ORGANIZED AI

Harness Engineering

A coding agent is the model plus everything you build around it. Harness engineering treats that scaffolding as a real artifact — and tightens it every time the agent slips.

Agent = Model + Harness

Top 30 → Top 5 Terminal Bench

6 primitives

Ratchet, don't brainstorm

Success silent · Failures verbose

// CORE IDEA

Agent = Model + Harness

A raw model isn't an agent. It becomes one once a harness wraps it with state, tool execution, feedback loops, and enforceable constraints. The harness is every line of code, config, and execution logic that isn't the model — and it dominates the behaviour you experience.

Viv Trivedy HumanLayer Anthropic Engineering Addy Osmani Simon Willison

Agent = Model + Harness. If you're not the model, you're the harness.

                  HARNESS ENGINEERING — SYSTEM DIAGRAM

┌──────────────────────────────────────────────────────────────────┐
│  OPERATOR                                                        │
└──────────────────────────────┬───────────────────────────────────┘
                               │  goal
                               ▼
┌──────────────────────────────────────────────────────────────────┐
│  HARNESS                                                         │
│                                                                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌────────┐   │
│  │ prompts     │  │ tools       │  │ context     │  │ hooks  │   │
│  │ AGENTS.md   │  │ bash · MCP  │  │ compaction  │  │ guards │   │
│  └─────────────┘  └─────────────┘  └─────────────┘  └────────┘   │
│         │               │                │              │        │
│         ▼               ▼                ▼              ▼        │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                        MODEL                               │  │
│  └────────────────────────────────────────────────────────────┘  │
│         │               │                │              │        │
│         ▼               ▼                ▼              ▼        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌────────┐   │
│  │ subagents   │  │ sandbox     │  │ memory      │  │ obs    │   │
│  │ planner/exec│  │ filesystem  │  │ AGENTS.md   │  │ traces │   │
│  └─────────────┘  └─────────────┘  └─────────────┘  └────────┘   │
└──────────────────────────────────────────────────────────────────┘
                               │
                               ▼
                        SHIPPED OUTPUT

harness primitives

5×

Terminal Bench rank gain

∞

moves with each model

Explore the Guide

Skill Issue Reframe

Most failures are configuration, not weights

mindsetterminal-bench

Harness Components

Six primitives, derived from behaviour

filesystembashsandbox

Long-Horizon Work

Ralph Loop, planner/generator/evaluator splits

ralph-loopP/G/E

AGENTS.md Discipline

Every line traces to a specific failure

ratchetchecklist

Harness-as-a-Service

From completion APIs to runtime APIs

SDKconvergence

References

Source articles and further reading

canon

// MINDSET

The "Skill Issue" Reframe

The agent does something dumb. The reflex is to blame the model and wait for the next version. Harness engineering rejects that default — failures are usually legible, and the next version of the harness encodes the lesson.

The pattern: the agent didn't know about a convention, so you add it to AGENTS.md. The agent ran a destructive command, so you add a hook that blocks it. The agent got lost in a 40-step task, so you split it into a planner and an executor. The agent kept "finishing" broken code, so you wire a typecheck back-pressure signal into the loop.

It's not a model problem. It's a configuration problem. — HumanLayer

The Terminal Bench data point

On Terminal Bench 2.0, Claude Opus 4.6 running inside Claude Code scored materially lower than the same model running in a custom harness. Viv Trivedy's team moved a coding agent from Top 30 to Top 5 by changing only the harness — same model, different scaffolding.

Models get post-trained against the harness they were trained inside. Moving them into a different harness — with better tools for your codebase, a tighter prompt, sharper back-pressure — can unlock capability the original harness was leaving on the floor. The gap between what today's models can do and what you see them doing is largely a harness gap.

Default fix

Wait for the next model

Better fix

Encode the lesson into AGENTS.md, hooks, tools, or subagent prompts — so the harness never makes that mistake again

// PRIMITIVES

Working Backwards from Behaviour

Start from the behaviour you want and derive the harness piece that delivers it. If you can't name the behaviour a component exists to deliver, it probably shouldn't be there.

Behaviour	Harness Component	Why
Work with real data, durably	Filesystem + Git	Workspace, offload, versioning, branches
Write and execute code	Bash + code execution	General-purpose tool — agent builds tools on the fly
Safe execution + defaults	Sandboxes + bundled tooling	Isolated env, allow-listed commands, headless browser, test runners
Remember new knowledge	Memory files + web search + MCPs	AGENTS.md reload across sessions; bridge training cutoff
Stay coherent over long context	Compaction · tool offloading · skills	Fight context rot; progressive disclosure
Long-horizon execution	Ralph loops · planning · verification	Multi-session work, self-check, P/G/E splits

The two underrated primitives

FILESYSTEM

Boring, foundational, underrated. Models can only operate directly on what fits in context. Without a filesystem, you're copy-pasting into a chat window. Add Git on top and you get versioning, rollback, and branch experiments for free. Most other primitives end up pointing at the filesystem for something.

BASH

Instead of pre-building a tool for every action, give the agent bash and let it build the tools it needs on the fly. Most tasks collapse to a few well-chosen CLI invocations. The difference between teaching someone to use one kitchen gadget and handing them a kitchen.

SANDBOX

Bash is only useful if it runs somewhere safe. Sandboxes give isolated, allow-listed environments, network isolation, and disposable runs. Good defaults matter: pre-installed runtimes, Git, test CLIs, headless browser. The model doesn't pick its execution environment — that's a harness call.

MEMORY

No way to edit weights in production, so context injection is the only path. AGENTS.md reloads every session — knowledge from one run carries to the next. Crude but effective continual learning. Web search and MCPs (Context7, etc.) bridge the training cutoff.

// MULTI-SESSION WORK

Long-Horizon Execution

Today's models suffer from early stopping, poor decomposition, and incoherence as work stretches across context windows. The harness designs around all of that — Ralph loops, planning files, planner/generator/evaluator splits, full context resets.

The Ralph Loop

A hook intercepts the model's attempt to exit and re-injects the original prompt into a fresh context window, forcing the agent to continue against a completion goal. Each iteration starts clean but reads state from the previous one through the filesystem. A surprisingly simple trick for turning a single-session agent into a multi-session one — the kind of primitive you'd never derive from "just use a smarter model."

       RALPH LOOP

   ┌────────────────────────────────┐
   │  goal.md (completion criteria) │
   └────────────────┬───────────────┘
                    │
                    ▼
   ┌────────────────────────────────┐         ┌─────────────────────┐
   │  fresh context  ──▶  agent  │  exit?  │  hook re-injects    │
   │                                │ ──yes──▶│  goal + state       │──┐
   │  reads state from filesystem   │         └─────────────────────┘  │
   │  writes progress to filesystem │                                  │
   └────────────────┬───────────────┘                                  │
                    │ goal met?                                        │
                    ├─ no  ◀───────────────────────────────────────────┘
                    │
                    ▼ yes
                  DONE

Planner / Generator / Evaluator

Anthropic's long-running harness work is explicit: separating generation from evaluation into distinct agents outperforms self-evaluation, because agents reliably skew positive when grading their own work. It's GANs for prose. The related pattern is the sprint contract — generator and evaluator negotiate what "done" actually means before code gets written.

Writing down the done-condition before starting catches more scope drift than any prompt change.

              PLANNER / GENERATOR / EVALUATOR

   ┌───────────┐    contract     ┌───────────┐    submit     ┌───────────┐
   │ PLANNER   │ ──────────────▶ │ GENERATOR │ ─────────────▶│ EVALUATOR │
   │ decompose │                 │ implement │               │ verify    │
   └───────────┘                 └───────────┘               └─────┬─────┘
        ▲                                                         │
        │                       reject + reasons                  │
        └─────────────────────────────────────────────────────────┘

Hooks — the enforcement layer

Hooks separate "I told the agent to do X" from "the system enforces X." A script runs at a specific lifecycle point: before tool call, after file edit, before commit, on session start. The right place for things the agent should never forget but often does — typecheck after edit, block rm -rf and git push --force, require approval before opening a PR.

Success is silent, failures are verbose. — HumanLayer

If typecheck passes, the agent hears nothing. If it fails, the error text gets injected into the loop and the agent self-corrects. Almost-free in the common case, directly actionable when something goes wrong.

// THE RATCHET

AGENTS.md Discipline

The flat markdown rulebook at the root of your repo is the single highest-leverage configuration point — it lands in the system prompt every turn. Two hard-won rules: keep it short, and earn each line.

Every mistake becomes a rule

The most important habit in harness engineering is treating agent mistakes as permanent signals. Not one-off stories to laugh about, not bad runs to retry. Signals.

If the agent ships a PR with a commented-out test and you merge it by accident, that's an input. The next AGENTS.md says "never comment out tests; delete them or fix them." The next pre-commit hook greps for .skip( and xit( in the diff. The next reviewer subagent flags commented-out tests as a blocker.

Every line in a good AGENTS.md should be traceable back to a specific thing that went wrong.

Pilot's checklist, not style guide

Keep it short. HumanLayer keeps theirs under 60 lines. Every line competes for attention; more rules make each rule matter less.
Earn each line. Rules trace to a specific past failure or hard external constraint. Ratchet — don't brainstorm.
Add only when you've seen real failure. Don't pre-write "principles." Wait for the regression, then encode it.
Remove when redundant. When a capable model makes a rule load-bearing for nothing, take it out.

Same discipline for tools

Each tool's name, description, and schema gets stamped into the prompt every request. Ten focused tools outperform fifty overlapping ones — the model can hold the menu in its head. Sloppy or malicious MCPs can prompt-inject your agent before you've typed anything; tool descriptions are trusted text the model will read.

// WHERE THIS IS GOING

Harness-as-a-Service

We're moving from building on LLM APIs (which give you a completion) to building on harness APIs (which give you a runtime). Claude Agent SDK, Codex SDK, OpenAI Agents SDK — all point in the same direction.

The default path shifts

The old default: build your own loop, wire your own tool-calling, handle your own conversation state, invent your own approval flow. The new default: pick a harness framework, configure it along the four pillars (system prompt, tools, context, subagents), and put the rest of your effort into domain-specific prompt and tool design.

That's what makes "skill issue" tractable. You're not rebuilding an agent from scratch every time something goes wrong — you're tuning a configuration surface that's already well-factored.

Good agent building is an exercise in iteration. You can't do iterations if you don't have a v0.1. — Viv Trivedy

Harnesses don't shrink, they move

The naive story: better models make harnesses obsolete. If the model can plan, no planner. If the model is coherent at long horizons, no context resets.

What actually happens: the ceiling moves with the model. Tasks that were unreachable are in play, and they have their own failure modes. The anxiety scaffolding goes away (Sonnet 4.5 wrapping up early as it approached its context limit — fixed in Opus 4.6), and in its place you need a multi-day memory policy, a harness coordinating three specialised agents, evaluators for design quality in generated UIs.

Every component in a harness encodes an assumption about what the model can't do on its own. — Anthropic Engineering

The model-harness training loop

Today's agent products are post-trained with harnesses in the loop. The model gets specifically better at the actions the harness designers think it should be good at: filesystem operations, bash, planning, subagent dispatch. That's why Opus 4.6 feels different inside Claude Code than in someone else's harness, and why changing a tool's logic sometimes causes strange regressions. A genuinely general model wouldn't care whether you used apply_patch or str_replace — but co-training creates overfitting.

        MODEL ↔ HARNESS TRAINING LOOP

  ┌──────────────────┐         ┌──────────────────┐
  │ primitive        │ ──────▶ │ standardised     │
  │ found in harness │         │ in product       │
  └──────────────────┘         └─────────┬────────┘
          ▲                              │
          │                              ▼
  ┌──────────────────┐         ┌──────────────────┐
  │ next-gen model   │ ◀────── │ used in next     │
  │ better at it     │         │ training run     │
  └──────────────────┘         └──────────────────┘

Look at the top coding agents side by side — Claude Code, Cursor, Codex, Aider, Cline. They look more like each other than their underlying models do. The industry is slowly finding the load-bearing pieces of scaffolding that turn a generative model into something that can ship.

// SOURCE & REFERENCES

References

The canon — read these in order if you want the full picture.

VIV TRIVEDY

Anatomy of an Agent Harness — coined the term, derived the components from behaviour. Plus HaaS framing.

ANTHROPIC

Harness design for long-running apps — the cleanest public breakdown of long-horizon harness design. Compaction, context resets, P/G/E.

HUMANLAYER

Skill issue — harness engineering for coding agents — the configuration-not-weights reframe. AGENTS.md discipline.

ADDY OSMANI

Agent Harness Engineering — the synthesis post that pulled the threads together. Source for this guide.

SIMON WILLISON

Designing agentic loops — agent as "tools in a loop to achieve a goal." Bash-first thinking.

FAREED KHAN

Building Claude Code with harness engineering — annotated architecture diagram.

Related Organized AI guides

harness-engineering-wiki ↗ hermes-pi-harness-guide ↗ autoagent-autoresearch-guide ↗ claude-code-guide ↗

// DEPLOY & RUN

Deploy & Run

How this page itself was built — single-file HTML, deployed to Cloudflare Pages via wrangler CLI from Claude Desktop with mcp-server-commands.

# Create the Pages project (idempotent)
export CLOUDFLARE_ACCOUNT_ID=691fe25d377abac03627d6a88d3eeac9
wrangler pages project create harness-engineering-guide \
  --production-branch main

# Write index.html, deploy
mkdir -p /tmp/harness-engineering-guide
# ... write index.html ...
cd /tmp/harness-engineering-guide
wrangler pages deploy . \
  --project-name harness-engineering-guide \
  --branch main \
  --commit-dirty=true

Live

harness-engineering-guide.pages.dev

Wiki

harness-engineering-wiki.pages.dev

Source

addyosmani.com/blog/agent-harness-engineering

Pattern

Organized AI single-file HTML guide · sticky sidebar nav · dark terminal theme