Harness Engineering
A coding agent is the model plus everything you build around it. Harness engineering treats that scaffolding as a real artifact — and tightens it every time the agent slips.
Agent = Model + Harness
A raw model isn't an agent. It becomes one once a harness wraps it with state, tool execution, feedback loops, and enforceable constraints. The harness is every line of code, config, and execution logic that isn't the model — and it dominates the behaviour you experience.
Agent = Model + Harness. If you're not the model, you're the harness.
HARNESS ENGINEERING — SYSTEM DIAGRAM
┌──────────────────────────────────────────────────────────────────┐
│ OPERATOR │
└──────────────────────────────┬───────────────────────────────────┘
│ goal
▼
┌──────────────────────────────────────────────────────────────────┐
│ HARNESS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────┐ │
│ │ prompts │ │ tools │ │ context │ │ hooks │ │
│ │ AGENTS.md │ │ bash · MCP │ │ compaction │ │ guards │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ MODEL │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────┐ │
│ │ subagents │ │ sandbox │ │ memory │ │ obs │ │
│ │ planner/exec│ │ filesystem │ │ AGENTS.md │ │ traces │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
▼
SHIPPED OUTPUT
Explore the Guide
The "Skill Issue" Reframe
The agent does something dumb. The reflex is to blame the model and wait for the next version. Harness engineering rejects that default — failures are usually legible, and the next version of the harness encodes the lesson.
The pattern: the agent didn't know about a convention, so you add it to AGENTS.md. The agent ran a destructive command, so you add a hook that blocks it. The agent got lost in a 40-step task, so you split it into a planner and an executor. The agent kept "finishing" broken code, so you wire a typecheck back-pressure signal into the loop.
It's not a model problem. It's a configuration problem. — HumanLayer
The Terminal Bench data point
On Terminal Bench 2.0, Claude Opus 4.6 running inside Claude Code scored materially lower than the same model running in a custom harness. Viv Trivedy's team moved a coding agent from Top 30 to Top 5 by changing only the harness — same model, different scaffolding.
Models get post-trained against the harness they were trained inside. Moving them into a different harness — with better tools for your codebase, a tighter prompt, sharper back-pressure — can unlock capability the original harness was leaving on the floor. The gap between what today's models can do and what you see them doing is largely a harness gap.
Working Backwards from Behaviour
Start from the behaviour you want and derive the harness piece that delivers it. If you can't name the behaviour a component exists to deliver, it probably shouldn't be there.
| Behaviour | Harness Component | Why |
|---|---|---|
| Work with real data, durably | Filesystem + Git | Workspace, offload, versioning, branches |
| Write and execute code | Bash + code execution | General-purpose tool — agent builds tools on the fly |
| Safe execution + defaults | Sandboxes + bundled tooling | Isolated env, allow-listed commands, headless browser, test runners |
| Remember new knowledge | Memory files + web search + MCPs | AGENTS.md reload across sessions; bridge training cutoff |
| Stay coherent over long context | Compaction · tool offloading · skills | Fight context rot; progressive disclosure |
| Long-horizon execution | Ralph loops · planning · verification | Multi-session work, self-check, P/G/E splits |
The two underrated primitives
FILESYSTEM
Boring, foundational, underrated. Models can only operate directly on what fits in context. Without a filesystem, you're copy-pasting into a chat window. Add Git on top and you get versioning, rollback, and branch experiments for free. Most other primitives end up pointing at the filesystem for something.
BASH
Instead of pre-building a tool for every action, give the agent bash and let it build the tools it needs on the fly. Most tasks collapse to a few well-chosen CLI invocations. The difference between teaching someone to use one kitchen gadget and handing them a kitchen.
SANDBOX
Bash is only useful if it runs somewhere safe. Sandboxes give isolated, allow-listed environments, network isolation, and disposable runs. Good defaults matter: pre-installed runtimes, Git, test CLIs, headless browser. The model doesn't pick its execution environment — that's a harness call.
MEMORY
No way to edit weights in production, so context injection is the only path. AGENTS.md reloads every session — knowledge from one run carries to the next. Crude but effective continual learning. Web search and MCPs (Context7, etc.) bridge the training cutoff.
Long-Horizon Execution
Today's models suffer from early stopping, poor decomposition, and incoherence as work stretches across context windows. The harness designs around all of that — Ralph loops, planning files, planner/generator/evaluator splits, full context resets.
The Ralph Loop
A hook intercepts the model's attempt to exit and re-injects the original prompt into a fresh context window, forcing the agent to continue against a completion goal. Each iteration starts clean but reads state from the previous one through the filesystem. A surprisingly simple trick for turning a single-session agent into a multi-session one — the kind of primitive you'd never derive from "just use a smarter model."
RALPH LOOP
┌────────────────────────────────┐
│ goal.md (completion criteria) │
└────────────────┬───────────────┘
│
▼
┌────────────────────────────────┐ ┌─────────────────────┐
│ fresh context ──▶ agent │ exit? │ hook re-injects │
│ │ ──yes──▶│ goal + state │──┐
│ reads state from filesystem │ └─────────────────────┘ │
│ writes progress to filesystem │ │
└────────────────┬───────────────┘ │
│ goal met? │
├─ no ◀───────────────────────────────────────────┘
│
▼ yes
DONE
Planner / Generator / Evaluator
Anthropic's long-running harness work is explicit: separating generation from evaluation into distinct agents outperforms self-evaluation, because agents reliably skew positive when grading their own work. It's GANs for prose. The related pattern is the sprint contract — generator and evaluator negotiate what "done" actually means before code gets written.
Writing down the done-condition before starting catches more scope drift than any prompt change.
PLANNER / GENERATOR / EVALUATOR
┌───────────┐ contract ┌───────────┐ submit ┌───────────┐
│ PLANNER │ ──────────────▶ │ GENERATOR │ ─────────────▶│ EVALUATOR │
│ decompose │ │ implement │ │ verify │
└───────────┘ └───────────┘ └─────┬─────┘
▲ │
│ reject + reasons │
└─────────────────────────────────────────────────────────┘
Hooks — the enforcement layer
Hooks separate "I told the agent to do X" from "the system enforces X." A script runs at a specific lifecycle point: before tool call, after file edit, before commit, on session start. The right place for things the agent should never forget but often does — typecheck after edit, block rm -rf and git push --force, require approval before opening a PR.
Success is silent, failures are verbose. — HumanLayer
If typecheck passes, the agent hears nothing. If it fails, the error text gets injected into the loop and the agent self-corrects. Almost-free in the common case, directly actionable when something goes wrong.
AGENTS.md Discipline
The flat markdown rulebook at the root of your repo is the single highest-leverage configuration point — it lands in the system prompt every turn. Two hard-won rules: keep it short, and earn each line.
Every mistake becomes a rule
The most important habit in harness engineering is treating agent mistakes as permanent signals. Not one-off stories to laugh about, not bad runs to retry. Signals.
If the agent ships a PR with a commented-out test and you merge it by accident, that's an input. The next AGENTS.md says "never comment out tests; delete them or fix them." The next pre-commit hook greps for .skip( and xit( in the diff. The next reviewer subagent flags commented-out tests as a blocker.
Every line in a good AGENTS.md should be traceable back to a specific thing that went wrong.
Pilot's checklist, not style guide
- Keep it short. HumanLayer keeps theirs under 60 lines. Every line competes for attention; more rules make each rule matter less.
- Earn each line. Rules trace to a specific past failure or hard external constraint. Ratchet — don't brainstorm.
- Add only when you've seen real failure. Don't pre-write "principles." Wait for the regression, then encode it.
- Remove when redundant. When a capable model makes a rule load-bearing for nothing, take it out.
Same discipline for tools
Each tool's name, description, and schema gets stamped into the prompt every request. Ten focused tools outperform fifty overlapping ones — the model can hold the menu in its head. Sloppy or malicious MCPs can prompt-inject your agent before you've typed anything; tool descriptions are trusted text the model will read.
Harness-as-a-Service
We're moving from building on LLM APIs (which give you a completion) to building on harness APIs (which give you a runtime). Claude Agent SDK, Codex SDK, OpenAI Agents SDK — all point in the same direction.
The default path shifts
The old default: build your own loop, wire your own tool-calling, handle your own conversation state, invent your own approval flow. The new default: pick a harness framework, configure it along the four pillars (system prompt, tools, context, subagents), and put the rest of your effort into domain-specific prompt and tool design.
That's what makes "skill issue" tractable. You're not rebuilding an agent from scratch every time something goes wrong — you're tuning a configuration surface that's already well-factored.
Good agent building is an exercise in iteration. You can't do iterations if you don't have a v0.1. — Viv Trivedy
Harnesses don't shrink, they move
The naive story: better models make harnesses obsolete. If the model can plan, no planner. If the model is coherent at long horizons, no context resets.
What actually happens: the ceiling moves with the model. Tasks that were unreachable are in play, and they have their own failure modes. The anxiety scaffolding goes away (Sonnet 4.5 wrapping up early as it approached its context limit — fixed in Opus 4.6), and in its place you need a multi-day memory policy, a harness coordinating three specialised agents, evaluators for design quality in generated UIs.
Every component in a harness encodes an assumption about what the model can't do on its own. — Anthropic Engineering
The model-harness training loop
Today's agent products are post-trained with harnesses in the loop. The model gets specifically better at the actions the harness designers think it should be good at: filesystem operations, bash, planning, subagent dispatch. That's why Opus 4.6 feels different inside Claude Code than in someone else's harness, and why changing a tool's logic sometimes causes strange regressions. A genuinely general model wouldn't care whether you used apply_patch or str_replace — but co-training creates overfitting.
MODEL ↔ HARNESS TRAINING LOOP
┌──────────────────┐ ┌──────────────────┐
│ primitive │ ──────▶ │ standardised │
│ found in harness │ │ in product │
└──────────────────┘ └─────────┬────────┘
▲ │
│ ▼
┌──────────────────┐ ┌──────────────────┐
│ next-gen model │ ◀────── │ used in next │
│ better at it │ │ training run │
└──────────────────┘ └──────────────────┘
Look at the top coding agents side by side — Claude Code, Cursor, Codex, Aider, Cline. They look more like each other than their underlying models do. The industry is slowly finding the load-bearing pieces of scaffolding that turn a generative model into something that can ship.
References
The canon — read these in order if you want the full picture.
VIV TRIVEDY
Anatomy of an Agent Harness — coined the term, derived the components from behaviour. Plus HaaS framing.
ANTHROPIC
Harness design for long-running apps — the cleanest public breakdown of long-horizon harness design. Compaction, context resets, P/G/E.
HUMANLAYER
Skill issue — harness engineering for coding agents — the configuration-not-weights reframe. AGENTS.md discipline.
ADDY OSMANI
Agent Harness Engineering — the synthesis post that pulled the threads together. Source for this guide.
SIMON WILLISON
Designing agentic loops — agent as "tools in a loop to achieve a goal." Bash-first thinking.
FAREED KHAN
Building Claude Code with harness engineering — annotated architecture diagram.
Related Organized AI guides
Deploy & Run
How this page itself was built — single-file HTML, deployed to Cloudflare Pages via wrangler CLI from Claude Desktop with mcp-server-commands.
# Create the Pages project (idempotent) export CLOUDFLARE_ACCOUNT_ID=691fe25d377abac03627d6a88d3eeac9 wrangler pages project create harness-engineering-guide \ --production-branch main # Write index.html, deploy mkdir -p /tmp/harness-engineering-guide # ... write index.html ... cd /tmp/harness-engineering-guide wrangler pages deploy . \ --project-name harness-engineering-guide \ --branch main \ --commit-dirty=true