When the Harness Gets Messy, Start Over

I mentioned in the previous post that I was using an agent to build an inference engine from scratch. By the third week, one thing became clear: the project had gone rotten.

Not the kind of mess where code won't run. Docs were exploding. Experimental scripts from seven or eight different directions were all crammed into one repo, benchmark results scattered across a dozen Markdown files. Every time the model started work, it burned half a day just figuring out where it had left off. I watched it jump between directions, scratching the surface of each before moving on. More guidance, and it tagged along obediently; less guidance, and it spun in place.

That wasn't its true capability. Just two days earlier, it had spent over ten hours stitching together a 78-layer pipeline from scratch.

Lost in the Chaos

I kept running into this same pattern, and the takeaway was always the same: once context gets messy, the model can't find its way out. I almost never saw it climb out of the mess on its own. Every time, I had to flip the table.

This isn't gut-feeling bias—someone actually measured it. Last year Microsoft and Salesforce simulated over 200,000 multi-turn conversations across 15 models. The conclusion is right there in the abstract: once an LLM takes a wrong step in a conversation, it gets lost and won't recover. Average performance drops 39%. Break it down and it gets more interesting: capability drops only 16%, but unreliability spikes 112%. The model didn't get much dumber; it became extremely unstable. The same question might get a great answer or a terrible one. That's exactly what I saw. It's not that it can't do it—it's that it can't perform.

Chroma's Context Rot report lands another blow: even with a task as simple as fishing one sentence out of a pile of text, the longer the input, the worse all 18 models perform—and the decline isn't uniform. Context is a finite resource. Anthropic calls it the attention budget: every extra token eats into it.

The messier your repo, the faster that budget burns.

After the Restart

Every time, the fix is the same: stop, think hard about what the goal actually is, open a new repo with fresh context, spell out that single goal and the execution path, and start over. After restarting, you can basically only lock onto one direction. Running multiple tracks in parallel is a luxury now.

The change is night and day. The same model that was spinning in the mess the day before acts like a different species after the restart: digging deep, proactively finding improvements, pushing forward at a completely different scale. My rough estimate is a three- to five-fold difference, and it's probably not linear. In chaos it might never hit the goal; in a clean environment it's actually converging on the target.

The advice from that Microsoft paper for users boils down to two things: if you have time, start a new session; before restarting, have the model consolidate everything it knows and bring it along. They tested it: merging info scattered across multiple turns into one and re-feeding it restores performance to 95% of a single-turn baseline. Anthropic does the same thing with its multi-agent research system: when context is nearly full, spin up a new agent with clean context and hand over properly.

Industry calls it compaction, context engineering, whatever—the name doesn't matter. The point is simple: what environment you give the model is what performance you get back.

Rewriting Is Suicide; Restarting Is Not

There's an old iron law in software engineering, written by Joel Spolsky over twenty years ago: never rewrite from scratch. Netscape decided to rewrite its browser, went three years without shipping a major release, and handed the market to IE. This iron law ruled the industry for two decades.

But it rests on a single premise: rebuilding is prohibitively expensive. When a team spends three years rewriting, the competition doesn't wait.

Agents kill the cost of rebuilding. New repo, fresh context, re-clarifying the goal—half a day's work. The conclusions and pitfalls from before can be organized into documents and carried over. When rebuilding shrinks from three years to half a day, the iron law flips: patching a rotten context costs more than starting fresh.

So the chaos of exploration that looks like wasted time actually isn't. The clarity you bring to a restart is precisely what grew out of the first round of fumbling around: which assumptions held, which directions were dead ends. Without that chaos, you couldn't have written that clarity. Throw away the code; keep the knowledge.

The Weight of the Harness

The word "harness" has burned through the eval community and into engineering over the past two years. In 2023, METR wrote scaffolding into its evaluation methodology. This February, OpenAI published a piece on Harness engineering, redefining an engineer's job as "designing the environment, expressing intent, and building feedback loops." In April, Martin Fowler's site gave the most concise definition: the harness is everything in an agent except the model itself.

The Terminal-Bench 2.0 leaderboard spells out how much it matters: the same Opus 4.6 with different harnesses scores anywhere from 58 to 76—a gap of 18 percentage points. Under the same harness, upgrading from GPT-5 to GPT-5.2 only improves by about 19 percentage points. Harness design matters as much as a full generation of models.

Addy Osmani put it crudely but accurately: an average model with a good harness can beat a top-tier model with a bad harness.

My own version is even cruder: the environment you give the model is its ceiling.

Human Value Is Rising

Another feeling that's grown stronger these past few months: humans are becoming more valuable.

Deciding whether to flip the table and restart is a human call. Distilling a pile of messy exploration results into "what is the real problem to solve" is also a human job. I've seen too many times these past months that agents can't find their way out of ambiguous, complex problems on their own. There's a line in that OpenAI blog post I strongly agree with: Humans steer. Agents execute.

So an agent isn't a wishing well; it's a very capable employee. Lead it well, and it can deliver outstanding results; toss it the work and walk away, and it probably won't produce much. Models keep getting stronger—Fable 5's engineering capability took another step up, and every generation pushes the boundary of what you can tackle. But the leading part—no one can do that for you yet.

Closing Thoughts

When it's a mess, restart. These five words are the most valuable lesson I've picked up over these months of using agents on complex projects.

Models will keep getting stronger. But the environment is yours to provide, and the decision to flip the table is yours to make.

The cheaper intelligence becomes, the more valuable clarity is.

References

Laban et al., "LLMs Get Lost In Multi-Turn Conversation", arXiv:2505.06120, 2025-05，https://arxiv.org/abs/2505.06120
Chroma, "Context Rot: How Increasing Input Tokens Impacts LLM Performance", 2025-07-14，https://research.trychroma.com/context-rot
Anthropic, "Effective context engineering for AI agents", 2025-09-29，https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Anthropic, "How we built our multi-agent research system", 2025-06-13，https://www.anthropic.com/engineering/multi-agent-research-system
Terminal-Bench 2.0 Leaderboard，https://www.tbench.ai/leaderboard/terminal-bench/2.0（2026-06-12 读取）
OpenAI, "Harness engineering: leveraging Codex in an agent-first world", 2026-02，https://openai.com/index/harness-engineering/
Birgitta Böckeler, "Harness engineering for coding agent users", martinfowler.com, 2026-04-02，https://martinfowler.com/articles/harness-engineering.html
Addy Osmani, "Agent Harness Engineering", O'Reilly Radar, 2026-05-15，https://www.oreilly.com/radar/agent-harness-engineering/
Joel Spolsky, "Things You Should Never Do, Part I", 2000-04-06，https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/
METR, "Evaluating Language-Model Agents on Realistic Autonomous Tasks", 2023-08，https://metr.org/blog/2023-08-01-new-report/