Yesterday around midnight, Anthropic dropped Opus 4.7. I had already lain down ready to sleep, but once the news broke I didn't end up sleeping; I got up and installed it to try it out.
Hands-on Impressions of 4.7
There were no miracle moments of "something I couldn't solve before suddenly working on the first try." I just didn't have a hard problem on hand that could serve as a test case, so I couldn't verify that. But after using it in my daily workflow for a day, a few things stood out.
Instruction following is noticeably better than 4.6. Requirements in docs, instructions in prompts—4.7 is more willing to follow them to the letter. Anthropic's official announcement put it as "takes the instructions literally." You can feel it after switching over and using it for a while.
Those cases of "haha, it's done" followed by code that doesn't run are also much rarer. 4.6 moves fast, but often it's hollow speed—over-optimistically declaring completion, then it still doesn't work when you run it. 4.7 will verify its own outputs before delivering. Its release notes mentioned this: "devise ways to verify its own outputs before reporting back."
There's another small change I didn't expect. I had a medium-sized project on hand that previously took many back-and-forth rounds with 4.6 plus Codex to converge. I switched to 4.7 and asked it to go through it: "see if there were any pitfalls back then." It identified several fairly serious issues that no one had realized before, and started fixing them. This improvement in "retrospective" capability feels quite noticeable even in a short time.
As a side note, 4.7 is not the fabled Mythos from internal rumors. Anthropic locked Mythos inside the Project Glasswing consortium due to safety concerns, reserving it for internal use by Amazon, Apple, Google, and Microsoft. According to official statements, 4.7 is the "less broadly capable" version—most likely a smaller, safer branch pruned from the same capability tree. The exact training details haven't been disclosed; we can only speculate.
But looking at the pace, it makes me a bit nostalgic. Opus 4.6 was released on February 5; 4.7 dropped on April 16—just over two months for a new generation. That's faster than before. The last time I had that "holy shit, another generation already?" feeling was at the end of 2022, shortly after ChatGPT 3.5 came out, when GPT-4 followed in March 2023.
So one habit of judgment needs adjusting: don't define problem boundaries by what the current model can do. Problems you struggle with for hours today, thinking they're impossible, might just work with the same approach three months from now when the next generation drops. This doesn't mean you can just lie back and wait for models to improve. It just means that when you're stuck, you can shelve it for a while, let it try a few more times—there's no need to write something off permanently just because it doesn't work today.
OpenClaude: First Time Contributing Code to an External Open-Source Project
A while back, the Claude Code source code leaked. People in my social circle quickly started wondering: the CLI's prompt interface and workflow are quite smooth—could we hack it to support multiple models?
I initially wanted to build it myself. I had Codex scan the Claude Code source and asked it, "How much work to abstract this layer into a multi-provider interface?" After assessing it, it told me there was a lot to change and asked if I was sure. I thought about it and decided: forget it, not worth the hassle.
Then two days later someone in my circle open-sourced exactly this project, called OpenClaude, built on top of that leaked source with multi-model support. I downloaded and installed it immediately.
It had a pile of issues right out of the gate.
First, the model effort was hardcoded to high, with no xhigh option. I use Codex on xhigh daily, so that was a blocker from the start. Second, agent calls would fail; I have a proxy environment here, and it would crash on every call. Third, it would constantly hang when opening a worktree, and fast mode couldn't be enabled either. With Codex's xhigh effort being slow to think, you'd wait ten-plus minutes each time. People on my team who tried it were immediately discouraged.
So I had Codex fix them one by one. Model selection was changed to a three-tier provider → model → effort with persistence; the agent proxy part was patched up; and a switch was added for fast mode with xhigh. Once local testing passed, I told it to submit a PR. It organized everything and stuffed all the changes into a single PR.
The PR got rejected. First, a few CI checks went red. The next day I got another email—an AI PR review from the maintainer's side. It had that typical AI style, pretty verbose, but the points were correct: "The changes are fine, but you touched dozens of files in one PR. This is unreviewable; split it up and resubmit."
I realized that was fair. That evening I opened Codex on my phone and told it to split the previous PR into smaller ones based on those comments. It quickly broke it into four and submitted them.
Another day later I got email again. Checked it—three were merged into main, and one was still wrapping up due to a CI conflict.
This is the first time my code has gone into an external open-source project. People submit PRs to my own projects too, but that's home turf. This time something that doesn't belong to me or my team accepted my changes. It feels pretty remarkable.
Honestly, I didn't make many technical judgment calls throughout this entire pipeline. I set the direction and reported the problems, but every line of code was written by the agent. The PR descriptions and splitting strategy were also reorganized by it according to the review comments. All I did was say yes, no, or keep going.
Strix Halo: A Non-Specialist Pushing Prefill to Near DGX Spark Levels
Another parallel track was performance optimization on the AMD Ryzen AI Max+ 395 (codename Strix Halo) box.
Some background first. Our team works on KTransformers, a SOSP'25 paper project collaboration between Tsinghua University's MADSys Lab and Approaching.AI, specializing in CPU/GPU heterogeneous inference for MoE models. My initial idea was simple: adapt KTransformers to the 395 machine.
One look upstream and I hit a wall. KTransformers' KT branch runs on SGLang, and SGLang has barely been adapted for Strix Halo. Even if I got the CPU operator half working, plugging in an SGLang that already performs poorly on this machine would still yield bad results overall. This path was a dead end.
So I changed tack. I had Codex install vLLM, llama.cpp, and SGLang, run a benchmark, and see where the baseline was first.
The model chosen was the BF16 native-precision Qwen3-30B-A3B. A 30.5B total parameter, 3.3B active parameter MoE model—just right to run on Strix Halo's 128GB unified memory. llama.cpp doesn't support Safetensors format, only GGUF, so I had it convert the format first.
The initial baseline came back with vLLM performing best: prefill under 1,000 tps, decode in the low teens tps. llama.cpp was second: prefill at three to four hundred tps, noticeably behind vLLM but stable. SGLang couldn't even run properly; after a few patches it barely started, and still performed worse than llama.cpp.
By normal reasoning this is when you pack up and go home—there wasn't much expectation anyway. But I had some free time and tokens to spare then, so I told Codex to keep optimizing SGLang: run experiments yourself, report back when done, and I'll pick one of two directions for the next round.
That's how it started. Two or three days, over 200 experiment rounds, with very little intervention from me—mostly just "continue." I didn't even look at which branch it chose each round; a lot of the technical details were beyond me anyway.
The final numbers surprised even me. Prefill hit about 1,300 tps, well above the original vLLM baseline. Decode stabilized around 20 tps—for this BF16 30B MoE on this machine, memory bandwidth basically dictates that ceiling, so there were no expectations of breaking through. More critically, it was now very close to the NVIDIA DGX Spark (GB10 plus 128GB unified memory, that $4,000 small chassis). Originally the gap between Strix Halo and Spark on this model was roughly 4x to 8x. Now prefill is basically neck-and-neck.
My colleagues' first reaction to the results was "That extreme? Is the correctness actually right?" We ran another round of correctness verification; the numbers checked out.
What actually shook me was the process. That 1,300 tps is just the surface. I have no technical background in AI infrastructure. Strix Halo wasn't even a machine I had planned to optimize—I picked it up because the other machine I was originally targeting went offline recently, and I had free time. After two or three days, it could actually approach the level of hand-tuning by top researchers. The time and cost spent weren't extravagant at all.
Getting to this point, I'm seriously considering a question I would never have dared to think about before: whether to clean this up and submit a PR to SGLang. As a non-specialist, contributing code to a mainstream inference engine. Right now the work hasn't been split up clearly enough, and polishing it to maturity will take some effort. But the fact that this idea can even be seriously considered is itself pretty novel.
Parallelization Becomes Possible
I still think human-in-the-loop is necessary. I agree with a colleague on this: humans have been filtered by natural selection, with real survival pressure, so humans setting goals and making trade-offs is relatively reliable. Agents have no survival pressure. If they go way off track or define goals poorly, they don't feel that "I'm wasting my life" anxiety—they just keep going. Often there's no absolute right or wrong between A and B, just a trade-off, and that trade-off has to be made by a human.
So having agents run purely autonomously without humans, operating a complex project long-term, isn't feasible yet.
But with agents this capable, the amount of time required to be "in the loop" drops significantly. You don't need to watch over the whole process anymore—just show up at key nodes, make judgments, and give direction.
Moreover, all of these tasks have long stretches where human participation isn't possible. An SGLang benchmark experiment runs for ten-plus minutes to half an hour—I can't help at all. For that OpenClaude PR, waiting for CI to finish or for the maintainer's AI review to come back takes anywhere from hours to a full day. With 4.7 reviewing an old project for issues, it reads code and makes judgments on its own; I can't help even sitting right there.
These waiting windows used to be naturally wasted time. The change now is that if the waiting windows of several tasks can be staggered, I can use the freed-up time to push another task forward. While SGLang is running experiments, I look at OpenClaude's PR review; while OpenClaude is waiting on CI, 4.7 has just finished reading code and is waiting for my call. Several tasks move forward in interleaving fashion, and none of them seems particularly delayed.
Too much parallelization isn't good either—it creates cross-talk. But with two to three tasks, the progress speed of each suffers little loss. Focusing on one project for eight straight hours is actually inferior to spending eight hours rotating among three projects—the windows that truly require deep engagement add up to only two or three hours total; the rest of the time you would have been waiting anyway.
This logic has always existed between people. Someone on the team waiting for another person's review would just do something else in the meantime. Now "the other person" has been replaced by an agent; the waiting cycles have gotten shorter, but the logic of interleaving windows is the same.
So lately I've been adjusting an old default. People used to say "focus on one thing and do it well." I don't oppose that principle now, but its premise is that this one thing occupies most of your time. If an agent compresses the time you truly need to invest heavily in one thing to one-third, then all those curiosities and ideas you've been suppressing can actually be let out.
The Strix Halo effort itself is an example. Our team had discussed it many times, but no one ever seriously pursued it because the ROI assessment was mediocre—not worth it. It was only because I had a block of free time recently and picked it up out of curiosity that different results emerged.
Over the past year I considered "focusing on one thing" to be a kind of default virtue. I still think focus matters, but "only do one thing" is no longer its best expression. If several directions you're genuinely interested in can stay open simultaneously, total output might actually be higher.
The precondition is that you are genuinely interested in each one. Parallelizing three tasks you don't care about still won't be efficient—that's just burning time.
References
- Introducing Claude Opus 4.7 — Anthropic — Released April 16, 2026; emphasizes literal instruction following, long-task consistency, and output self-verification
- Introducing Claude Opus 4.6 — Anthropic — Released February 5, 2026; only two and a half months between generations
- Claude Mythos Preview — red.anthropic.com — Available only to the Project Glasswing consortium (Amazon, Apple, Google, Microsoft, Nvidia, etc.); official acknowledgment that 4.7 is "less broadly capable"
- GPT-4 — Wikipedia — Released March 14, 2023, only four months after ChatGPT 3.5
- OpenClaude — A community multi-model agent framework rebuilt from leaked Claude Code source; several PRs mentioned in this article have been merged into main
- kvcache-ai/ktransformers — GitHub — CPU/GPU heterogeneous MoE inference framework by Tsinghua MADSys Lab and Approaching.AI
- KTransformers @ SOSP'25 — Original paper; prefill 4.62–19.74x, decode 1.25–4.09x speedup
- Accelerating Hybrid Inference in SGLang with KTransformers CPU Kernels — LMSYS — Integration notes for SGLang and KTransformers
- Qwen3-30B-A3B — Hugging Face — 30.5B total parameters, 3.3B active, 128 experts with top-8 selected; suitable for heterogeneous inference with large memory and small VRAM
- AMD Ryzen AI Max+ 395 — AMD — 16-core Zen 5, 128GB LPDDR5X-8000, Radeon 8060S iGPU
- Strix Halo LLM Performance Tracking — llm-tracker.info — Community benchmark roundup; baseline status of vLLM/llama.cpp/SGLang on the 395
- NVIDIA DGX Spark — NVIDIA — GB10 Grace Blackwell + 128GB unified memory; launched October 15, 2025
