These days, I use a lot of models daily. One thing I've noticed: every model has a different personality. Even within the same company, the style changes completely from one generation to the next. Figuring out who excels at what and where they tend to fail has itself become a collaborative skill.
Opus 4.6 — Reckless, But Gets the Job Done
The one I use most is still Claude Opus 4.6.
It's strong at engineering, executes efficiently, and often comes up with solid approaches to complex problems. It's quick to diagnose bugs, saving a lot of time compared to other models. When doing architecture design, it actively asks about your intent—it doesn't just start coding, but first digs into six different dimensions for you to judge before getting its hands dirty. When it comes to alignment with humans, it does this better than any other model I've used.
The downsides are just as obvious.
It has zero aesthetic sense for UI. The web pages it builds are ugly and stiff, and when you ask it to fix them, it doesn't know which direction to go. This model is pretty funny.
Another problem is recklessness. It charges ahead before fully understanding things. You test it—wrong. You tell it it's wrong, it looks back and realizes it missed this or that, fixes it, tests again—still wrong. This back-and-forth happens a lot. It looks fast, but sometimes that speed is hollow; it gets overly optimistic that it's done, wraps up without checking, and calls it a day.
But the ecosystem is well-built. Claude Code plus Chrome extensions, PowerPoint plugins, and Excel plugins—one account covers a lot of ground. It's still the first choice for automation.
GPT 5.4 — Can Chew Through Hard Problems, But Doesn't Listen Well
I used Codex 5.3 before. It was mediocre, good at catching bugs, and did decent code review. 5.4 is stylistically very different from 5.3.
My impression is that it's not necessarily efficient, but it's comprehensive. Tasks that Opus fails at once, twice, three times—5.4 can nail in one go. Of course, it takes longer, grinding away through constant self-testing, checking, and revising.
Human alignment is weaker. I asked it to write an architecture design doc at maximum effort; it finished without asking many questions. One look, and it was very different from what I had in mind. Later, I had it write code according to the doc—partway through it drifted again. The local parts were neat, but the overall direction was off. Suddenly it conjured up some bizarre module and kept polishing it. I killed the session.
A lot of people have told me the same thing: 5.4 always wants to innovate, solving problems in weird ways. You give it SOPs and guidance, and it doesn't necessarily follow them.
But its computer use shocked me. I had it help configure a WeChat customer service backend—no Chrome extension, pure browser automation. I just scanned a QR code to let it into the backend, and it clicked through a bunch of configurations, found something wrong, went back to fix it, back and forth for about thirty minutes, and got it all set up. I still don't know how it figured it out.
There's another fun workflow. I have Opus write something, then have 5.4 find flaws and review it. 5.4 is sharp with criticism. Then I throw 5.4's comments back at Opus, turn on Max effort, and let it handle it. Every time, Opus says something like "this comment is extremely sharp, precise, and makes sense," then fixes it obediently. Haha, they actually respect each other.
So what is 5.4 good for? Chewing through hard bones that other models can't crack, then doing final checks and review. It can also give good advice on deep thinking, but letting it lead the direction tends toward over-engineering.
Gemini 3.1 Pro — Good Eye, Clumsy Hands
Google's Gemini 3.1 Pro. First off, this model has way too many hurdles to use—all kinds of verifications and roadblocks. Why does using a model have to be such a pain?
I finally got it working. It has its own ideas on frontend design. Recently I used it to update my official website. It looked at it and realized my theme was "weightlessness," but the original square animation was falling down with normal gravity, which contradicted the theme. It proactively suggested changing it to a weightless effect, and the result was pretty interesting. I'd shown other models the same page before, and none of them suggested anything like this.
Then the problem came—errors after the changes. Fix again, still errors. Fix again, more errors. It can spot what's ugly, but after changing it, it can't fix its own bugs. Throw the same error at Opus, and it investigates and fixes it in one go. Gemini tells you "all good, take a look," and you look—still broken. That's more or less how it goes.
GLM 5 — Strong Start, Weak Finish
Among domestic models, Zhipu's GLM 5 feels most like early Opus 4.5. Straightforward and no-nonsense: it gets straight to work, precise and efficient.
But around step twenty, things change. It starts deviating from the original instructions, and its global tracking ability falls noticeably behind Opus. After step twenty, it might run off into a weird branch and keep expanding there. Opus 4.6 can execute continuously for two or three hours without drifting much; GLM starts floating by step twenty. Within twenty steps, the difference is small. Beyond twenty steps, it's huge.
MiniMax 2.5 — Cheap and Fast, But Overcomplicates Simple Things
MiniMax feels pretty good—cheap and fast. But it over-engineers to a ridiculous degree—you give it a simple problem, and it produces an extremely complex solution. You look at it and think, this was clearly just a few lines of code, how did it end up like this?
Instruction following is decent, but it gives up too easily. Push it a bit and it says "I can't figure this out, try a different approach." A cost-effective alternative for daily use.
Kimi K2.5 — Smooth Talker, Clumsy Worker
I also use Kimi K2.5 quite a bit. It's fast—around a hundred tokens per second—with a nice frontend and native image understanding.
It has a lot of small issues. During execution, nothing seems right—frequent syntax errors. What's worse is that it's great at sweet-talking—what it says looks reasonable, plausible but specious, but doesn't match actual software behavior. You read its response and think "hmm, that makes sense," then you run it and realize that's not how it works at all.
But when it writes articles or gives feedback, the language is conversational and fun to read. A very interesting model, but if you're relying on it for actual work, you need to verify everything extra carefully.
So What?
No single model can do everything. Sometimes swapping models works better than tuning a prompt. Opus is the main workhorse, 5.4 chews through hard problems and does review, Gemini is occasionally useful for UI opinions. Domestic models offer great value for short tasks. Understanding each model's personality is far more effective than stubbornly grinding away with just one.
