These days, I'm using a lot of models daily. One thing I've realized is that every model has a different temperament. Even within the same company, the style changes completely with a new generation. Figuring out who's good at what and where they tend to fail has itself become a collaborative skill.
Opus 4.6—Reckless, but Effective
The one I use most is still Claude Opus 4.6.
It's good at engineering, highly efficient in execution, and often gives solid approaches to complex problems. When it hits bugs, it diagnoses them quickly, saving considerable time compared to other models. When doing architecture design, it actively asks about your intentions—not just diving into coding, but first digging out six dimensions for you to evaluate before getting to work. In terms of alignment with humans, it does the best among all models I've used.
The downsides are also obvious.
It has no aesthetic sense for UI. The web pages it produces are ugly and stiff, and when you ask it to improve them, it doesn't know where to start. This model is quite amusing.
Another issue is recklessness. It charges ahead before fully understanding things. You test it—wrong. You tell it it's wrong, it looks back and realizes it missed this or that, fixes it, tests again—still wrong. This back-and-forth happens frequently. It appears fast, but sometimes it's hollow speed; it gets overly optimistic thinking it's done, wrapping up without proper checks.
But the ecosystem is well-built. Claude Code plus Chrome extensions, PowerPoint plugins, Excel plugins—one account covers many scenarios. It's still the first choice for automation.
GPT 5.4—Can Crack Hard Problems, but Doesn't Listen Well
I used Codex 5.3 before—it was mediocre, good at catching bugs, decent at review. 5.4 is stylistically very different from 5.3.
My impression is that it's not necessarily efficient, but comprehensive. Tasks that Opus fails at once, twice, or three times, 5.4 can nail in one go. Of course, it takes longer, constantly testing, checking, and revising itself, grinding through it slowly.
Human alignment is weaker. I used maximum effort to have it write architecture design docs; it finished without asking many questions. Looking at it, it was quite different from what I had in mind. Later, when I had it write code according to the docs, it veered off track again—locally neat, but globally off-direction. Suddenly it created some inexplicable module and kept polishing it. I stopped it directly.
Many people have told me that 5.4 always wants to innovate, solving problems in weird ways. You give it SOPs and guidance; it doesn't necessarily follow them.
But computer use surprised me. I had it help configure a WeChat customer service backend—no Chrome extensions, pure browser automation. I just scanned a QR code to let it into the backend; it clicked through a bunch of configurations, found errors, went back to fix them, back and forth for about thirty minutes, and got everything configured. I have no idea how it figured it out.
There's another interesting play. I have Opus write something, then have 5.4 critique and review it. 5.4 is sharp in its criticism. Then I throw 5.4's comments back to Opus, crank it to Max effort, and let it handle it. Every time Opus responds with something like "this comment is extremely sharp, precise, and valid," then obediently makes the changes. Haha, they actually respect each other.
So what's 5.4 good for? Tackling hard problems that other models can't crack, then doing final checks and review. It can also give good advice on deep thinking, but letting it lead the direction tends toward over-engineering.
Gemini 3.1 Pro—Good Eye, Clumsy Hands
Google's Gemini 3.1 Pro. First off, this model has way too many barriers to entry—all kinds of verification and obstacles. Why is using a model such a hassle?
Finally got it working. In frontend design, it has its own ideas. Recently I used it to update my official website; it looked at it and found that my theme was "weightlessness," but the original square animation was falling with normal gravity, contradicting the theme. It proactively suggested changing it to a weightless effect, and the result was quite interesting. I'd shown the same page to other models before, and none of them suggested this.
Here's the problem—errors after the changes. Fix again, still errors. Fix again, still errors. It can tell what's ugly, but after fixing it, it can't repair its own bugs. Throw the same error at Opus, it investigates and fixes it in one go. Gemini tells you "all good, take a look," and you look—still issues. That's pretty much how it goes.
GLM 5—Strong Start, Weak Stamina
Among Chinese models, Zhipu's GLM 5 feels most like early Opus 4.5. Straightforward, gets right to work, precise and efficient.
But around step twenty, things change. It starts not following the initial instructions, and its global tracking ability falls significantly behind Opus. After step twenty, it might wander onto some weird branch and keep expanding there. Opus 4.6 can execute continuously for two or three hours without drifting much; GLM starts drifting at step twenty. Within twenty steps, the difference is small; beyond twenty steps, the difference is huge.
MiniMax 2.5—Cheap and Fast, but Overcomplicates Everything
MiniMax feels pretty good—cheap and fast. But it's ridiculously over-engineered—you give it a simple problem, and it produces an extremely complex solution. You look at it and think, this was clearly just a few lines of code, how did it become this?
Instruction following is okay, but it gives up too easily. Push it a bit and it says "can't figure this out, try a different approach." A cost-effective alternative for daily use.
Kimi K2.5—Smooth Talker, Clumsy Worker
I also use Kimi K2.5 quite a bit. Fast, at one hundred tokens per second, good frontend, native image understanding.
Lots of small issues. When executing, nothing seems right, frequent syntax errors. What's worse is it's particularly good at sweet-talking—what it says looks reasonable, plausible but specious, but doesn't match actual software behavior. You read its response and think "yeah, that makes sense," but when you run it, that's not how it works.
However, when writing articles or giving feedback, its language is conversational and interesting to read. A very amusing model, but if you're relying on it for work, you need to verify multiple times.
So, in Conclusion
No single model can do everything. Switching models sometimes works better than tuning prompts. Opus for main work, 5.4 for tough problems and review, Gemini occasionally consulted for UI. Chinese models offer good value for money, comfortable for short tasks. Figuring out each model's temperament is far more effective than stubbornly sticking to one model.