Skip to main content
Blog
When the Harness Is a Mess, Restart

When the Harness Is a Mess, Restart

Using an agent to build an inference engine from scratch, the project grew messier and performance degraded. The solution was always the same: stop, redefine the goal, open a new repo, and start over. The same model acted like a different entity. Microsoft simulated 200,000 conversations and found the same thing: a model on the wrong path won't turn back.

Jiawei GuanJiawei Guan5 min read
Share:

I mentioned earlier that I'd been using an agent to build an inference engine from scratch. By the third week, I realized something: this project was busted.

Not the "won't compile" kind of broken. Documentation was exploding. Experimental scripts from seven or eight directions were crammed into one repo, benchmark results scattered across a dozen markdown files. Every time the model started work, it spent half a day just figuring out where it had left off. I watched it flit between directions, scratching the surface of each before moving on. Give it more guidance and it obediently followed along; give it less and it just spun in place.

That wasn't its usual level. Two days earlier, it had spent twelve hours connecting a 78-layer pipeline from scratch.

Lost in the Chaos

I kept running into this same situation, and the conclusion was always the same: once the context turns into chaos, the model can't find its way out. I almost never saw it climb out on its own. Every time, I had to tear the whole thing down.

This isn't gut feel; someone measured it. Microsoft and Salesforce simulated over 200,000 multi-turn conversations last year, testing 15 models. The conclusion was right there in the abstract: once an LLM takes a wrong step in conversation, it gets lost and won't recover. Average performance dropped 39%. The breakdown is more interesting: capability only dropped 16%, but unreliability shot up 112%. The model didn't get much dumber; it became wildly unstable. On the same question, sometimes it answered great, sometimes terribly. That matches exactly what I saw: it's not that it couldn't do it, it just couldn't perform.

Chroma's Context Rot report landed another blow: even for a task as simple as fishing one sentence out of a pile of text, the longer the input, the more all 18 models' performance dropped. And the drop wasn't uniform. Context is a finite resource. Anthropic calls it the attention budget. Every extra token you stuff in burns part of it.

The messier your repo, the faster this budget burns.

After the Restart

The solution was always the same: stop, figure out what the goal actually is, open a new repo with fresh context, write down this one goal and the execution path clearly, and start over. Note: "this one." After a restart, you can basically only lock onto one direction. Running multiple lines in parallel is a luxury at this point.

The change is staggering. The same model, spinning in a mess the day before, acts like a different species after the restart: digging deep, actively hunting for improvements, pushing forward with efficiency that isn't even on the same scale. My rough estimate is a three- to fivefold difference, and it's probably not even linear. In a chaotic environment it might never reach the goal. In a clean environment it's genuinely converging on the target.

The Microsoft paper's advice for users boils down to two things: if you have time, restart the session; before you do, have the model consolidate what it knows and carry it over. They tested it: merging information scattered across multiple turns into a single turn and re-feeding it restores performance to 95% of a single-turn session. Anthropic does the same thing with its multi-agent research system: when context is almost full, spin up a new agent with clean context and hand off properly.

The industry calls this compaction, or context engineering, or whatever. The name doesn't matter. It all comes down to one thing: whatever environment you give the model, that's the performance you get.

Rewriting Is Suicide, Restarting Is Not

There's an old iron law in software engineering, written by Joel Spolsky over twenty years ago: never rewrite from scratch. Netscape decided to rewrite its browser, went three years without a major release, and handed the market to IE. This iron law ruled the industry for two decades.

But it rests on a premise: rebuilding costs are extremely high. A team rewriting for three years while competitors don't stop and wait.

Agents eliminate rebuilding costs. New repo, new context, re-clarifying the goal. Half a day's work. Take the conclusions and pitfalls from before, organize them into documents, and carry them over. When rebuilding drops from three years to half a day, the iron law flips: patching a rotten context is far more expensive than restarting.

So the chaos of exploration that looks wasted isn't actually wasted. The clear goal at restart grows directly out of that first round of random bashing: which assumptions held, which directions closed off. Without that chaos, you couldn't write that clarity. Code can be thrown away; cognition is carried forward.

The Weight of the Harness

The word "harness" has burned its way from the evaluation community to the engineering community over the past two years. METR wrote scaffolding into its evaluation methodology back in 2023; this February OpenAI published a piece on Harness engineering, redefining an engineer's job as "designing the environment, expressing intent, and building feedback loops"; in April, Martin Fowler's site gave the most concise definition: the harness is everything in an agent except the model itself.

How heavy is that weight? The Terminal-Bench 2.0 leaderboard spells it out: the same Opus 4.6, with different harnesses, scores from 58 to 76. That's a gap of 18 percentage points. Under the same harness, swapping GPT-5 for GPT-5.2 only raises the score by 19 percentage points. Good versus bad harness design is the difference of a full model generation.

Addy Osmani put it crudely but accurately: an average model with a good harness can beat a top-tier model with a bad harness.

My own version is even cruder: the environment you give the model is its ceiling.

Human Value Is Rising

There's another feeling that's been growing stronger these past few months: humans are becoming more valuable.

Deciding whether to tear it all down and restart is on the human. Converging from a pile of messy exploration results to "what is the real problem to solve" is also on the human. I've seen too many times these past few months where an agent can't find its way out of fuzzy, complex problems. There's a line in that OpenAI blog post I strongly agree with: Humans steer. Agents execute.

So an agent isn't a wishing machine; it's a very strong employee. If you lead well, it can produce outstanding results; if you toss the work and walk away, it probably won't produce much. Models keep getting stronger; Fable 5's engineering capability took another step up, and every generation pushes the boundary you can cross further out. But the "leading" part: no one can do that for you yet.

Closing Words

When it's a mess, restart. These five words are the most valuable lesson I've accumulated from using agents on complex projects these past few months.

Models will keep getting stronger. But the environment is yours to provide, and the decision to tear it all down is yours to make.

The cheaper intelligence becomes, the more valuable clarity is.


References

  1. Laban et al., "LLMs Get Lost In Multi-Turn Conversation", arXiv:2505.06120, 2025-05, https://arxiv.org/abs/2505.06120
  2. Chroma, "Context Rot: How Increasing Input Tokens Impacts LLM Performance", 2025-07-14, https://research.trychroma.com/context-rot
  3. Anthropic, "Effective context engineering for AI agents", 2025-09-29, https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
  4. Anthropic, "How we built our multi-agent research system", 2025-06-13, https://www.anthropic.com/engineering/multi-agent-research-system
  5. Terminal-Bench 2.0 Leaderboard, https://www.tbench.ai/leaderboard/terminal-bench/2.0 (retrieved 2026-06-12)
  6. OpenAI, "Harness engineering: leveraging Codex in an agent-first world", 2026-02, https://openai.com/index/harness-engineering/
  7. Birgitta Böckeler, "Harness engineering for coding agent users", martinfowler.com, 2026-04-02, https://martinfowler.com/articles/harness-engineering.html
  8. Addy Osmani, "Agent Harness Engineering", O'Reilly Radar, 2026-05-15, https://www.oreilly.com/radar/agent-harness-engineering/
  9. Joel Spolsky, "Things You Should Never Do, Part I", 2000-04-06, https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/
  10. METR, "Evaluating Language-Model Agents on Realistic Autonomous Tasks", 2023-08, https://metr.org/blog/2023-08-01-new-report/

Recommended Reading

Subscribe to Updates

Get notified when I publish new posts. No spam, ever.

Only used for blog update notifications. Unsubscribe anytime.

Comments

or comment anonymously
0/2000