Architecture inside - Apr 2026

Carless Engines

Why the harness matters more than the model when you are building real AI products.

1. What matters more — the model or the harness?

Simple question. If you sat down today to build an AI product, which would you pick first — the model or the harness?

Most people pick the model. It makes sense. Models are what shows up in the news. SWE-bench, MMLU, GPQA, arena rankings. Claude 4.7 vs GPT-5.4 vs Gemini 3.1 Pro. New release every two weeks, +2 points on the benchmark, the forums explode.

The harness, by contrast, is invisible. Technical glue. Wrapper. “It’s just a prompt and some tool calls.”

This is the most common misunderstanding in the industry right now. Here’s why it’s wrong.

The chat era is already over. In 2023, everyone “used ChatGPT.” Open chat, ask a question, paste the answer. Chat was the interface, the model, and the product, all at once. It worked for simple tasks — rewrite an email, explain a concept, draft a list. But who actually extracted real value beyond the everyday use case? A pretty narrow slice: writers and marketers offloading first drafts, teachers, a handful of advanced technical users with multi-page prompts.

That’s it. Past that, you hit a ceiling. To make an LLM genuinely useful in production — to have it edit your codebase, draft a jurisdiction-specific legal document, or run a sales conversation — “asking the model” isn’t enough. You need an environment the model operates in. Context, tools, checks, state, permissions.

The choice paradox that nobody explains. OpenRouter exists. 300+ models behind one API. Cline, Aider, Forge — bring-your-own-model. You can plug any model into any tool. Freedom is maximal. And yet, concentration is increasing.

MetricNumberSource
Engineering teams using AI daily73%Pragmatic Engineer, Feb 2026 (15K developers)
Senior devs naming Claude Code “most loved”46%Pragmatic Engineer (Cursor — 19%, Copilot — 9%)
Claude Code annualized run-rate at 9 months$2.5BAnthropic (fastest in product history)
Top-3 model gap on SWE-bench Verified2–3 ptsPrinceton SWE-bench

Maximum freedom of model choice. And specific products built around specific models keep winning. Why?

Because the model is a programming language. The harness is the product.

Nobody picks software based on what it’s written in. You can fanboy over Rust at an interview, but at work you’ll open Slack and VS Code. Not “an Electron app” or “a TypeScript app.” Just a product that solves the problem.

AI is exactly the same. And that’s what the rest of this piece is about — how that harness is built, why it matters more than it looks, and why the value layer there will keep growing for at least five more years.

2. What a harness is and why you need one

A naked model is an API call. Send text, receive text. No memory, no tools, no checks. If something breaks, you debug it. If a task takes five steps, you stitch them together yourself.

A harness is everything wrapped around the model. Tool orchestration, context management, safety, state, error recovery, observability.

Mitchell Hashimoto’s analogy, the one that made the term stick in February: the model is the engine. The harness is the chassis, the brakes, the steering, the dashboard. The engine doesn’t go anywhere on its own.

Why the term surfaced now. Models got good enough to be useful, but not reliable enough to leave alone. Without a harness, an agent in a loop will confidently make the same dumb mistake over and over. With a harness, it catches itself.

A harness is an operating system for an agent. Not “a wrapper around an API.”

3. Harness vs orchestrator — where the line is

These two terms get confused constantly. Let’s split them.

An orchestrator is a tool that conducts multiple components, agents, or services to complete a task. LangGraph, n8n, Temporal, AutoGen. The focus is coordination between nodes. It’s like the assembly line at a factory.

A harness is the environment around one agent (or one agentic system) that makes it reliable. Context, tools, safety, recovery. The focus is the agent’s survival on a long-running task. It’s like one worker’s station: tools at hand, safety gloves, a cart, a shift log.

They overlap. A modern advanced harness has a micro-orchestrator inside it. Claude Code calls them “swarms” — sub-agents coordinating through natural language. ForgeCode has three built-in agents (Forge, Sage, Muse), each with its own role. But the reverse isn’t true: bare LangGraph without context management or compaction isn’t a harness.

In April 2026, Anthropic published “Effective harnesses for long-running agents” — about how an agent should live across context window boundaries when a task runs for hours. Their answer: an initializer agent sets up the environment, follow-up agents pick up via claude-progress.txt plus git history. That's harness thinking in its purest form. Not "how do we make the model smarter." Rather: "how do we shape the environment so that an inconsistent model produces a consistent result."

4. What’s inside a harness — six layers

Now let’s look at what actually sits inside a modern harness. Six layers, from the model outward. I pieced this picture together from three sources: the dissected Claude Code leak, the OpenDev paper (arXiv 2603.05344), and LangChain’s posts from the past two months.

┌──────────────────────────────┐
        │  6. Observability            │  ← traces, evals, metrics
        │  ┌────────────────────────┐  │
        │  │  5. State & Sessions   │  │  ← worktrees, checkpoints
        │  │  ┌──────────────────┐  │  │
        │  │  │  4. Safety       │  │  │  ← permissions, sandbox
        │  │  │  ┌────────────┐  │  │  │
        │  │  │  │ 3. Memory  │  │  │  │  ← compaction, reminders
        │  │  │  │ ┌────────┐ │  │  │  │
        │  │  │  │ │2. Tools│ │  │  │  │  ← 19+ permission-gated
        │  │  │  │ │┌──────┐│ │  │  │  │
        │  │  │  │ ││  1.  ││ │  │  │  │  ← prompt = constructor
        │  │  │  │ ││ MODEL││ │  │  │  │
        │  │  │  │ │└──────┘│ │  │  │  │
        │  │  │  │ └────────┘ │  │  │  │
        │  │  │  └────────────┘  │  │  │
        │  │  └──────────────────┘  │  │
        │  └────────────────────────┘  │
        └──────────────────────────────┘

1. The system prompt. Not a single file — a constructor. Claude Code has dozens of conditional blocks that get assembled dynamically per request: identity, tool descriptions, environment info, git context, language preferences, skills, output style. When people share a “leaked system prompt,” that’s a simplification. There’s no real final prompt — it’s reassembled every turn.

2. Tools. Claude Code has 19 of them. Each is sandboxed, with permission gates. Sub-agents (internally “swarms”) coordinate via natural language, not via a DAG. That’s a deliberate engineering choice: text-based coordination is easier to test and easier to explain to the model itself.

3. Memory and context. The hardest layer. Includes compaction (auto-summarizing old messages as you approach the limit), system reminders (event-driven injections that fight attention decay), persistent memory (CLAUDE.md, AGENTS.md), tool result optimization (stripping noise from tool outputs). Claude Code uses a three-tier memory architecture, with MEMORY.md as a pointer index — about 150 chars per line — at its core.

4. Safety. Plan-Execute-Verify, permission popups, sandboxing. A delightful detail: Claude Code’s user-frustration detector is a regex. Patterns like wtf, this is broken, so frustrating get caught by regular expressions, and the next response's tone is adjusted accordingly. No model call. This is the deliberate "cheapest tool that solves the problem" principle: a $0 regex beats a $0.01 LLM call when the accuracy is comparable.

5. State and sessions. Git worktrees, conversation trees, checkpoints, the ability to resume from a prior point. pi.dev stores sessions as actual trees — you can rewind to any branch of the conversation and continue from there.

6. Observability. Telemetry, traces, token metrics, evals. LangSmith for the LangChain stack, in-house infrastructure at Anthropic. Without this layer, you can’t tell why your agent got worse this week.

This is not “a prompt + a while-loop + an API call.” This is a full distributed system, with a dedicated subsystem in every layer. And this is what sits between “a naked model from OpenRouter” and “a product people pay $200/month for.”

The system prompt is reassembled every turn, tools sit behind permission gates, memory compresses on the fly, safety is sometimes solved by a $0 regex, state lives in a git worktree, and without observability you have no idea why your agent is worse this week than last.

While we’re on the subject — one of the most interesting open-source examples is ForgeCode. Rust, MIT, 6,000+ GitHub stars, multi-agent design (Forge writes code, Sage does read-only research, Muse generates plans), supports 300+ models.

Their headline number: TermBench 2.0–81.8%, officially #1 in the world.

Plot twist. Except TermBench is ForgeCode’s own benchmark, on their own domain. On the independent SWE-bench Verified from Princeton, the gap collapses to 2.4 points. The harness design is genuinely good, no quibbles there. The numbers — many quibbles.

The takeaway in one line: vendor benchmarks prove nothing. Take 5–10 representative tasks from your own workflow, run them through 3 different harnesses with the same model — that gives you the answer.

5. Adaptive harnesses vs token efficiency

The most common objection to complex harnesses: adaptivity is expensive. Session trees, checkpoints, evaluation steps, elaborate system prompts — they all eat tokens before any work even starts. Multi-agent systems multiply the context by the number of agents. LangChain confirmed this experimentally: a “reasoning sandwich” (xhigh-high-xhigh) on gpt-5.2-codex added +13.7 points on TermBench, but burned 2× more tokens.

Sounds reasonable. And it’s an illusion of a poorly designed harness.

A well-designed adaptive harness uses adaptivity to save tokens:

TechniqueSavingsSource
Adaptive context compaction (every 10–15 tool calls)22.7%OpenDev paper, on SWE-bench
Code execution via MCP instead of tool-by-toolup to 98.7%Anthropic experiment
Stripping noise from tool results39.9–59.7%Augment Code analysis
Lazy tool discovery + dynamic skill loadingvariesOpenDev, ForgeCode

Where the trade-off is real. Multi-agent systems with isolated contexts lose amortization on repeated queries — every sub-agent boots fresh. Over-structured plan-execute-verify pipelines pay the full re-plan cost when a plan goes stale. Reflexion-style self-critique loops double the latency.

A different counter-example — pi.dev by Mario Zechner. His philosophy: “Adapt pi to your workflows, not the other way around.” “Primitives, not features.” Pi deliberately ships without sub-agents, plan mode, permission popups, background bash, or built-in to-dos. All of those are TypeScript extensions, installed on demand. A minimal system prompt = a clean prompt cache = fewer default tokens = more room for the user’s actual task. Adaptivity through minimalism, not through complexity.

The thesis. The “adaptivity vs token efficiency” trade-off is a myth. A competent adaptive harness throws things away, it doesn’t pile them on. Every token-saving technique you’ll find in the literature — compaction, lazy loading, code execution, RAG instead of raw context — is adaptivity.

What’s expensive isn’t adaptivity. What’s expensive is naive adaptivity. When a team piles on features as decoration without ever pruning the old ones. Don’t confuse complexity with competence.

6. Coding harnesses are just the start

Coding harnesses dominate the news cycle not because they’re the most important. They got there first for three reasons.

Clean feedback signal. Code either compiles or it doesn’t. Tests either pass or they don’t. That’s an ideal optimization signal for a harness. Other domains don’t have it — there’s no writing-bench, and certainly no medical-bench with that kind of clarity.

A paying audience. Developers will pay $100–200 a month for a tool that saves them an hour a day. That gives a startup unit economics on day one.

High value per successful operation. An hour of senior developer time saved is $50–150. If the harness saved an hour, it paid for itself the same day.

But behind coding harnesses, every other vertical is coming. This isn’t a forecast. It’s already happening.

Law. Harvey was launched in 2022 by a former O’Melveny & Myers litigator and a research scientist from Google DeepMind. By April 2026:

  • 700,000 agentic tasks executed daily
  • 50 million term extractions per week
  • Through self-improvement loops, agents went from 41% to 88% on complex legal tasks
  • $200M Series round in March 2026 at an $11B valuation (GIC + Sequoia)
  • A&O Shearman (formerly Allen & Overy) deployed Harvey to 4,000 staff across 43 jurisdictions, with 2,000 lawyers using ContractMatrix daily
  • 2–3 hours saved per week on routine work, contract review time cut by 30%

“A surplus of intelligence bottlenecked by judgment.” — Gabe Pereyra, Harvey, April 2026

Same logic as for programmers: value is migrating from execution to judgment.

The first casualties are visible too. In February 2026, Baker McKenzie announced layoffs of around 700 business services staff, partly citing AI. Per Crunchbase, 79% of all legal startup investment in the past year (about $2.2B) went specifically into AI categories. Forrester modeled, for LexisNexis, AI tooling in the in-house legal team of a hypothetical $10B company: external counsel volume drops 13%, time on internal legal queries drops 25%, paralegals save 50% on admin tasks.

In parallel — regulatory pressure. On February 10, 2026, a federal judge in New York ruled in United States v. Heppner: documents typed into a consumer AI tool are not protected by attorney-client privilege or the work-product doctrine. The implication: lawyers now need enterprise harnesses like Harvey, with private infrastructure, audit trails, and a compliance layer — because Claude.ai in a browser is now a legal liability for confidential material.

That’s the moment an industry transitions from “let’s try ChatGPT” to “we need a properly engineered harness with guarantees.” The same will happen in finance, biotech, customer ops, education. Anywhere knowledge work needs orchestration.

But the most interesting case is where the task isn’t “right/wrong” — it’s “beautiful/ugly.” In March 2026, Anthropic published “Harness design for long-running application development,” in which Prithvi Rajasekaran from their Labs team described how they pull Claude out of “safe, predictable, technically functional but visually unremarkable” layouts. The approach is pure harness engineering, not model fine-tuning:

  • Split the generator from the evaluator. A generator agent writes the design. A separate critic agent evaluates it. Tuning a skeptical evaluator is far more tractable than getting a generator to honestly criticize its own work.
  • Four grading criteria baked into both agents’ prompts: design quality (does it cohere as one thing?), originality (is there evidence of custom decisions or is it library defaults?), visual hierarchy, content fit.
  • A feedback loop with explicit signal. The generator iterates against specific criteria, not against vague “make it better.”

The key insight from the post: “aesthetics can’t be fully reduced to a score, but they can be improved with grading criteria that encode design principles.” In other words — even when there’s no SWE-bench and no compiler, the harness still wins by decomposing taste into checkable signals.

And that closes the picture. Coding harnesses handle tasks with verifiable results (code works or it doesn’t). Legal harnesses handle tasks with regulated results (audit trails, citation verification). Design harnesses handle tasks with subjective results (taste). Three different problem types, the same engineering pattern: decompose the task, split the roles, build feedback between them.

And here’s the key point — look at what specifically shifts inside the profession. According to Clio’s 2025 report, 69% of paralegal tasks are work AI can do or significantly accelerate. But firms that deployed AI didn’t cut paralegal headcount. They expanded their throughput: the same number of people now handles 40% more matters. Faster client turnaround, higher satisfaction, less drudgery. Paralegals who learned to operate inside a harness earn $5–15K more per year than those still doing it by hand.

That is exactly the same pattern as for programmers. The narrow “code monkey” role — manually translating logic into syntax — is shrinking. The new role — an operator who can verify AI output, catch hallucinations, write working prompts, and understand the harness’s limits — is more expensive and in higher demand.

Harness engineering principles travel. Persistent context as rules-files. Citation verification as a tool layer. Compaction as mandatory context hygiene. Permission gates on every tool that touches the outside world. Audit trails for compliance. Eval infrastructure that catches regressions before prod.

Coding harnesses did the dirty work. Every adjacent industry inherits the patterns ready-made.

Closing

In 2023, the move was “ask ChatGPT.” In 2024, “give it skills.” In 2025, “vibe code.” In 2026, the right question reads:

What harness are you building, and what constraints does it actually enforce?

If you’re shipping a product on top of an LLM — your real competitor isn’t on the ML team. They’re on the team that knows how to assemble a context window, how to compress tool results without losing meaning, how to set up evals that catch regressions before prod, and how to keep the agent from making the same mistake twice in a loop.

Models are converging. The top three are 2–3 points apart on any reasonable benchmark. Which means product differentiation has moved to where it always lives in mature industries — into the engineering around the core.

A naked model is a naked engine. No chassis, no brakes, no dashboard. You can be as proud of the horsepower as you like. Nobody’s driving it anywhere.

This is the second piece in the series. The first — Unskilled Skills — on why skills so often turn into an antipattern.

If you’re interested in the vibe-coding bubble, the overheated market, and why the $4.7B valuation is a symptom rather than a cause — that’s coming as a separate long piece. Follow along.

Keep reading

Browse the rest of the Watcher archive across product reviews, architecture notes, and economics.

Browse articles

Build with us,
or learn alongside us.

The Factory ships for its own practice first, publishes what becomes useful, and helps outside teams when the problem fits our craft.