The day Claude launched Opus 4.8, I was pretty excited.
I'd always thought Opus had solid engineering skills. Macro analysis and intent alignment were its strong suits. A few points in the 4.8 release notes caught my eye, so I tried it out on several complex tasks I had going.
The results were one disappointment after another. Here are a few examples.
Looks Promising Out of the Gate, Then Goes Off the Rails
Right around then, my Codex credits had run out. Since 4.8 looked solid on paper, I put it on a research task using that flashy feature called ultracode, a dynamic workflow that supposedly auto-orchestrates ultra-long-horizon tasks.
It started out looking solid. It ran a bunch of checks and got everything set up. Seemed reliable. That's Claude Code in a nutshell: the start always gives you hope.
Then I let it run for about a full day and night.
The task was to optimize a performance metric. The baseline was terrible: throughput sat around 0.1 tokens per second. After optimizing for a while, it bumped the number from 0.1 to 0.15 and started celebrating. Look, I improved it by 50%! I did all this work! What a huge achievement! It kept hauling out that basic initial setup to claim credit, writing pages of self-congratulatory fluff.
The problem is, in the 0.1 to 0.15 range, using multiples to understand the problem is wrong to begin with.
When performance is that low, your direction is completely wrong. You have to look at absolute values. What does 0.15 token/s actually mean? That's how you see how far off it is. Celebrating a "50% improvement" in a setup that fundamentally can't run is like celebrating that you bailed two buckets from a sinking ship.
Looking back, the pile of documentation it recorded, those so-called results, pretty much all had to be tossed. The direction was wrong. Freeze everything and start over.
This wasn't even the most surprising part. I'd always had the impression that Claude's macro capabilities were fine, good at analysis and alignment, but concrete execution, especially on research tasks, tended to go sideways. The real red flag was the engineering task that followed.
A Pure Engineering Problem, and It Was Clowning Around
It was a cloud service. A probe was reporting errors, and I asked it to diagnose the issue and fix it while it was there. Nothing complicated.
A couple of things it did left me baffled.
First, in the middle of analysis it randomly started counting. It had the machine echo a string of numbers. I stared at the screen, maybe a dozen in total. I still have no idea what that was about. Pure token burning.
Second was even more absurd. The probe's purpose was straightforward: workers send periodic heartbeats returning OK to tell the platform "I'm alive and working." If the heartbeat is abnormal, take the worker offline first. Don't assign it tasks. It discovered heartbeats had a cost and asked if I wanted to cut it. I said sure.
It changed it to runtime --version, which returns a version number.
I laughed out loud. This isn't cutting costs. It's destroying the design intent entirely. A version number only proves the thing is installed. It has nothing to do with whether it can actually work. Effectively, it's saying "it's installed, so go ahead and assign tasks." A model supposedly strong in engineering and alignment pulled this on a problem where the intent was crystal clear.
I said forget it, let's revert to the old setup.
During the revert, something else happened. When looking for a fix, it told me "here are three options below," asking me to pick. But it never listed the three options. It just jumped straight to "I recommend option one." I scrolled back several times. The three options didn't exist at all.
At this point I was pretty certain: this model genuinely can't be counted on for this kind of task.
So I Moved All That Work to GPT 5.5
After these incidents, the direct result was simple: for research and engineering tasks, I no longer trust Claude.
Trust builds slowly and collapses fast. Earlier I thought it was still usable, but the more I used it, the more I realized asking it to do something was probably a waste of time. Not just a matter of burning a few tokens, but having to redo everything in the end. Now I don't even consider it for these tasks. Everything goes to GPT 5.5.
GPT 5.5 is genuinely strong. Whether coding or research, it's clearly a notch above every other model.
This shows up most directly in my account setup: I'm now running 7 Codex accounts, maxing out weekly quotas on all of them, and it's still not enough. On the Claude side, I'm down to one account barely hanging on. It keeps getting my accounts banned, so I bought a spare just to hold onto. I never seriously used Google's.
Seven to one. That's more honest than any benchmark.
But This Isn't Just a Claude Problem
At this point I need to walk that back a bit. If this were just bashing Claude, the post wouldn't be worth reading.
Step back, and you see the top models have always taken turns in the spotlight.
Six months ago Gemini was riding high. Everyone was talking about how strong Google was. The last two generations have both felt underwhelming. Hardly anyone mentions them now. Then everyone flocked to Claude, thinking it was the strongest. But look at 4.7 to 4.8. It's been a real letdown. This round, OpenAI has clearly gotten back on its feet. GPT 5.5 is ridiculously strong.
No single company can stay on top forever. This isn't unique to the AI industry.
Chip Companies' Fate, Replayed by Model Makers at Fast-Forward
I was chatting with a friend about chips the other day. He told me how brutal strategy is in that business. Bet wrong on one generation of chips, and you might be finished.
NVIDIA nearly died. Its first chip, the NV1, bet on forward texture mapping and quadrilaterals, but the industry standard went with triangles (Microsoft's DirectX). The wrong direction meant no one wanted the product, and the company shrank from about 100 people to 40.
What saved them was Sega. Sega had commissioned NVIDIA to build the graphics chip for its console. Later both sides realized the direction was wrong, but Sega's Shoichiro Irimajiri still converted that roughly $5 million contract payment into an investment in NVIDIA. Jensen Huang later said this gave them "six months to live," just enough to survive until the RIVA 128 turned things around.
AMD was the same story. The 2011 Bulldozer architecture was a strategic mistake. Single-core performance was awful, and the company was badly wounded. By July 2015, its stock had crashed to 293 million, largely to stop the bleeding. It didn't recover until the Zen architecture arrived in 2017.
Both are now dominant players. But looking back, no one can guarantee they'll have the last laugh. Chip cycles are long and capital-intensive. One generation every three to five years, and one wrong move means massive pressure, or even getting knocked out entirely.
Model companies move much faster. It doesn't take a whole generation. Maybe just a few model iterations, a roughly six-month window of consistently missing the mark, and a company can be pushed off the table.
Chinese Model Makers Have Already Completed a Full Rotation
Chinese model companies have already played out this entire cycle.
The first to break out and gain recognition as a top-tier player was Zhipu. In 2022, its GLM-130B was the only large model from Asia selected for Stanford's HELM evaluation, and ChatGLM was among the first open-source models at home. For a while, it was unrivaled.
Then it fell behind. By late 2024, its flagship GLM-4-Plus had been overtaken by DeepSeek-V3 and Tongyi Qianwen on public benchmarks like SuperCLUE, dropping out of the top tier. At the time, a lot of people were surprised.
Then on January 20, 2025, DeepSeek-R1 burst onto the scene. Six days later its app hit #1 on the US download charts, along with 51 other countries. On January 27, it directly tanked NVIDIA's stock, wiping out $589 billion in market cap in a single day, the largest single-stock single-day loss in US market history. During that stretch, I felt like a bunch of model vendors were on the verge of collapse.
But Zhipu didn't leave the table. In the second half of 2025, it changed tactics, going fully open-source while narrowing its focus. It released GLM-4.5 and GLM-4.6 back to back, and its reputation clearly recovered. In January 2026, it even listed on the Hong Kong Stock Exchange. From falling behind to bouncing back, the key was staying at the table.
As a Side Note: Top Models Are Also Specializing
Beyond taking turns at the top, there's now another dynamic: division of labor.
For hardcore coding agents and research tasks, OpenAI is in a league of its own, clearly ahead of everyone else. Claude's original strong suit was precisely this area, but after several generations it failed to meet expectations and started regressing. Its strengths have instead shifted to white-collar work: writing, finance and legal, daily tasks, document research, that sort of thing. To be fair, 4.8 is a bit better than 4.7. It sounds more natural, its style moved back toward 4.6, execution is a bit more accurate, and it's actually pretty decent at writing.
Further toward the fringe you have Doubao and its ilk. Everyone knows it doesn't do serious work, but its emotional value is maxed out and its user base is terrifyingly large. There's a saying that "Claude is the American version of Doubao." I thought it was kind of funny at first, but thinking about it, it points to a different kind of divergence: some models are just good at chatting with you and giving you emotional value. That's also a category of demand.
So "which one is the strongest" isn't even the question anymore. It depends on what job you need done.
Final Thoughts: Don't Bet on Who's Strongest—Bet on Who's Still at the Table
Coming full circle, my conclusion is actually pretty optimistic.
The lead changes hands. Don't fixate on "whoever's strong now will stay strong," and don't write a company off just because it's temporarily stumbling. As long as you're still at the table, you have a chance to turn things around. Zhipu turned it around. NVIDIA and AMD both turned it around back in the day.
The real danger is leaving the table.
Anthropic has indeed been stuck these past few generations. My use of Claude is now basically limited to white-collar tasks. It's reportedly cooking up its next-gen flagship. If that comes out at the same level, things will really get dicey. It's not that any single model is bad, but in this six-month-cycle rhythm, missing the mark for several generations in a row is exactly how you get squeezed off the table.
But as long as it's still at the table, I'm not too worried. That's just how it works at the table: everyone takes turns.
References
- Crucible Moments: Nvidia — Sequoia Capital (Jensen Huang recounts the NV1 wrong bet, Sega's $5 million, and "six months to live")
- NVIDIA CEO Jensen Huang — Acquired Podcast
- AMD's Chinese joint venture Tianjin Haiguang (including the $293 million licensing fee and 2019 Entity List) — Tom's Hardware
- AMD–Chinese joint venture — Wikipedia
- How Lisa Su brought AMD back from the brink — CNN Business
- GLM-130B: An Open Bilingual Pre-trained Model (ICLR 2023, the only Asian model selected for Stanford HELM)
- Zhipu AI open-sources GLM-4.5, performance comparable to the latest Claude and DeepSeek models — The Batch (DeepLearning.AI)
- HK$57.9 billion market cap! Zhipu, the world's first large-model stock, lists in Hong Kong (02513.HK) — Securities Times
- DeepSeek-R1 Release (2025/01/20 official release page) — DeepSeek API Docs
- Nvidia's $589 Billion DeepSeek Plunge Is Largest in Market History — Bloomberg
- DeepSeek displaces ChatGPT as the App Store's top app — TechCrunch
