Skip to main content
Blog
Codex 5.5: Don't Trust the Version Number

Codex 5.5: Don't Trust the Version Number

Two weeks since GPT-5.5, codenamed SPUD, went live, I've more or less stopped opening Claude Code. What surprised me wasn't that it got a few percentage points stronger—it was that after several long-standing weaknesses were fixed all at once, the design philosophy of agents became clear for the first time.

Jiawei GuanJiawei Guan7 min read
Share:

OpenAI quietly launched GPT-5.5 on April 23, codenamed SPUD (potato). The version number was conservative—5.4 to 5.5, a mere 0.1 bump. After two weeks of use, my gut feeling is that the actual leap is far wider than that. It is the first base model fully retrained since GPT-4.5, not a fine-tune on top of 5.4.

The change in my own behavior is the most direct proof. I was a heavy Claude Code user; now I barely open it. The tab is still there, used only in a handful of scenarios. The rest of the time, I'm in Codex 5.5.

The reason for switching isn't that it's 5% or 10% better. It's that several of Claude Code's most frustrating weaknesses were fixed in one go with 5.5.

1. The Moments That Made You Want to Hit "Deny" Are Gone

Anyone who's used Claude Code has felt this: when you first start, it asks for permission for everything. Can it read this file? Run this command? Connect to the internet? Every step requires a click.

Then people realized: they were basically clicking "Allow" every time. So they flipped the switch to dangerously bypass permissions and let everything through by default.

This reflects, from the side, that the tool was mature enough. You hadn't genuinely wanted to hit "Deny" in a long time. It doesn't go rogue—at least, it doesn't easily break things. But it also often struggled to get things right on the first try. Claude's hallmark is speed: it can show you something within minutes, but when you review it, obvious problems are sitting right there; it simply missed them.

This "fast but inaccurate" experience is the biggest weakness of this generation of Claude Code.

What surprised me about 5.5 is that it fixed several long-standing weaknesses all at once.

The process started speaking human. Previously, Codex's intermediate steps were a blob of middleware; even if you stared at it, you couldn't tell what was going on. After 5.5, it tells you what it is doing at each step and why. This doesn't sound complicated, but it makes a huge difference for long-running tasks.

No more dithering. Before, it felt like a task would try left, then try right, taking a huge detour before finding the right spot. Now its judgment is clearer; it knows when to accelerate. If it sees a task is waiting, it will go do something else in parallel and prepare what the next step needs ahead of time. I had never seen this behavior before, and it surprised me the first time. End-to-end execution efficiency took a step up because of it.

Next-step suggestions suddenly became valuable. Codex has always had a habit: at the end of every task, it offers a next step. In the GPT-5.4 era, this suggestion was often filler. After 5.5, I found its proposals became quite reliable. Working continuously on one project for eight hours, I basically clicked "continue" every time. This means its ability to grasp "high-value directions" has come online.

The old compact problem is finally solved. I didn't use 5.4 deeply enough to be sensitive to compact issues. But after switching over in earnest, long sessions inevitably require context compression. Codex's compression is much more stable than Claude Code's. Most information survives the squeeze; with Claude Code, one compact and half the face is gone.

This has been studied and compared. Codex hands the entire conversation to the model to rewrite into a structured "handoff summary" carrying metadata and tool states; Claude Code uses a hierarchical human-readable summary, which is legible to the naked eye but inherently lossy. Both approaches have their merits, but in actual long sessions, my sense is that Codex loses less information.

The experience in the 5.4 era, where compact often interrupted due to network issues, is also completely fixed. Sessions can run continuously for an entire day with almost no attention needed.

2. The Shape Has Changed

Back in 5.4, research capability was already quite strong. Once 5.5 fixed the weaknesses, its entire shape changed.

I can now have it complete complex tasks with a very high degree of automation—provided the task can be verified end-to-end. For example, performance tuning at the operator layer. These scenarios have fine granularity, fast iteration, and A/B comparability. 5.5 runs them very professionally. It proactively designs queries, runs improvements, and then verifies whether the next step actually got better. The whole chain pushes forward on its own. Without a human present, it can run dozens of rounds continuously.

It's now hard to trip up in these dense loops. The numbers OpenAI published support this intuition: 82.7% on Terminal-Bench 2.0, 51.7% on FrontierMath tiers 1–3, and 95% on ARC-1 reasoning tasks. Feedback from researchers is even more direct: people are starting to let it run experimental variants overnight and coming back in the morning to see completed sweep dashboards.

It's far from perfect, of course. I've noticed two clear ceilings.

The first is big-picture systemic thinking. Ask it to look at the global picture from an overall architectural level, and its field of view is limited; the solutions it gives often stay local.

The second is innovation and creativity. If you completely hand it off, the solution it gives will most likely be by-the-book. You have to throw your own ideas at it first and let it expand and evaluate along your line of thought.

But even with these shortcomings, the whole thing already feels different. You used to assume that if you didn't watch it, it would drift off course. Now you find that after running for several hours, looking back, the vast majority of what it did was reasonable. On the main thread of writing code, it spontaneously prioritizes verification and cautiously advances step by step, rather than charging forward recklessly.

This side of it suddenly made one thing clear to me: the design philosophy of agents has actually already solidified.

3. "Assume the Next Step Will Fail" Is a Philosophy

From Claude Code to Codex, the design of this class of coding agents is built on the same assumption: the next step might fail.

The entire engineering effort is constructed around this assumption. How do I do this step well given that the next step might fail? If it really fails, do I have a path to roll back or bypass? Tool call didn't succeed? Try a different angle. What was written doesn't match expectations? Go back and re-verify.

It sounds simple, but this is what 98% of the code in Claude Code's engineering does. Permission gates, context management, tool routing, error recovery logic—most of it is deterministic engineering infrastructure. The model itself is just one component in this harness.

What 5.5 showed me is that this philosophy is starting to bear fruit. In its execution, "verify first, then advance" has become a spontaneous choice. It wants to make sure every step holds up, not how fast it can run at once.

This is the so-called "slow is fast."

Our previous expectation of agents was always "Hey, charge straight through and give me a result in half an hour." Then you look back and it's riddled with errors; the time spent fixing things is 3× or 5× the original. It feels good in the short term, but in the long run it's actually the slowest approach.

Once every small step is steady, the thing to do is figure out how to make it fast. Fast mode, parallel subtasks, 24/7 uninterrupted operation—these multipliers only work if the underlying steps don't go wrong.

4. Why I Say This Is the Eve of an Explosion

Stringing together the facts above:

  • It can already surpass a considerable portion of domain experts.
  • It can work tirelessly for hours on end.
  • It can do fairly complex research-grade work, as long as the task can be verified end-to-end.
  • Its cost hasn't risen significantly. For the same Codex task, 5.5 uses even fewer tokens than 5.4.

What does this mean? It means the explosion will happen first in domains that are "fine-grained, verifiable, and iterable." Operator optimization, performance tuning, research in specific subdomains, automated workflows. Adding a tireless colleague with reasoning near top experts who can design its own experiments—the speed of progress in these areas will be frightening.

Beyond that line—big-picture judgment for large systems, original ideas, cross-domain synthesis—the human role remains vital. But with every minor version iteration, the model shifts a little further in that direction.

Three months ago, I was having dinner with a friend who posed a question: coding agents are so strong now, but have you actually seen people without an engineering background ship things with them?

At the time I said, "It'll happen gradually." Now it's already happening. A yoga studio owner with zero programming experience used Lovable to build their own booking system in two hours—login, scheduling, payments, all included—and launched it that same week. Examples like this are multiplying. Inside NVIDIA, over ten thousand employees use Codex for daily work, spanning engineering, legal, finance, and operations. Codex is no longer just a tool for engineers.

I believe the next step will go even further. These non-engineering people will start shipping real research results. They'll first appear in those fine-grained, verifiable, iterable subdomains.

5.4 showed me capability gains. 5.5 showed me the path. Agents evolving under this design philosophy will genuinely transform the shape of knowledge production.

If the current pace of development isn't interrupted, the knowledge explosion will arrive faster than many expect.

A leader from China's Ministry of Industry and Information Technology (MIIT) once asked me which directions are worth watching. I said at the time to keep an eye on the GPT-5 line; its capability trajectory is different from Claude Code and from the earlier generation of lightweight agents, and the types of problems it can solve are different too. After 5.5 came out, I confirmed this judgment again. Two different evolutionary paths need to be viewed separately.

Next, there will definitely be a large wave of interesting products and results shipped to the market. We'll see in three months.


References

Recommended Reading

Subscribe to Updates

Get notified when I publish new posts. No spam, ever.

Only used for blog update notifications. Unsubscribe anytime.

Comments

or comment anonymously
0/2000