I've been using Codex's /goal for weeks, and my token consumption has climbed another notch. Claude Code added the feature in its May 12 2.1.139 release—straight to stable, not experimental. I had a few tasks that Codex never quite managed to finish, so I moved them over to try.
The contrast was stark. Same paradigm, nearly identical loop, yet the two models produced completely different results.
I'm writing this partly to think it through, partly because it's worth sharing. /goal isn't so much a feature as a new way of working. The form looks identical, but when the model's personality differs, the practical reality is entirely different.
1. Codex: Heads Down, No Questions, Never Quits
Let me start with Codex as a baseline.
Codex CLI's /goal appeared as an experimental feature in 0.128.0. I've been using it since then and wrote about it previously. The real shift has been mental: I actually started believing that "letting the agent run" really works.
It doesn't interrupt you. When running /goal, Codex almost never calls subagents; it works inline unless I explicitly tell it to delegate. Compaction works better than I expected. After compressing, it picks up from the previous round without major information loss, and doesn't suddenly get dumber as it pushes forward. Most importantly, it's stubborn. It almost never tells me a goal is unachievable. Even when it hits a wall, it tries another angle, then another, until the token budget runs out. I've tested this repeatedly. I've left three independent /goal sessions running overnight; in the morning, most are still on track.
The context window is a genuine weak spot. Codex defaults to 400K under GPT-5.5. OpenAI balanced pricing and throughput there, while the API offers the full 1M. Claude Code defaults to 1M. But even with only 400K, Codex runs remarkably stable under /goal.
2. Claude Code: Beautiful Opening, Then What?
On May 12, Anthropic dropped /goal, Agent View, /bg, /loop, and /batch all at once. My first thought was "finally." Codex had been iterating on this for several versions; Claude Code felt a bit slow to catch up.
I moved the tasks Codex couldn't crack over to Claude Code and started /goal.
It started strong. Claude Code immediately spun up subagents, laid out plans, and orchestrated context. It looked far more ambitious than Codex. My expectations immediately rose. With an opening like this, it should outperform Codex.
But as it ran, issues cropped up.
The first thing that made me frown: it kept popping up to ask me to make choices. This is usually one of Claude Code's likable traits. Faced with a judgment call, it doesn't just plow ahead. It stops to align with you, asking which of directions A, B, or C you prefer. And the questions are usually on point. But under /goal, this is a bug, not a feature. The whole point of /goal is "you set the goal, I run myself, don't interfere." The model should own every intermediate judgment. When it pops out with questions, those hours of freed-up time are immediately lost. If you step away, it just sits there waiting for you to come back.
More surprisingly, it proactively tells me it can't achieve the goal. Then it actually fails the goal. Sometimes after just a few dozen minutes. The reason is usually that the task seems too large for the session, or that there are fundamental blockers. When I tell it to continue, it reluctantly pushes forward a bit, then does it again.
Third: it gets dumber after compaction. A 1M context window sounds huge, but Anthropic themselves have admitted that performance degrades over long runs. Worse is the compression step. After each compaction, Claude Code often seems to have forgotten everything that came before. The original plan, the pitfalls already encountered, the original context—all have to be pieced back together. Codex doesn't suffer from compaction nearly as badly.
These three issues combined make long-horizon tasks unstable in Claude Code's /goal.
3. It's Not Just My Impression
At first I thought it was my usage. Then I looked around and realized Opus 4.7's laziness was already common knowledge.
Opus 4.7 was released on April 16. Within 48 hours, a Reddit thread titled "Opus 4.7 is not an upgrade but a serious regression" got over 2,300 upvotes. AMD's AI director publicly complained that Claude Code had become "dumber and lazier." Screenshots were everywhere. Someone posted a conversation where Claude itself replied, "I was acting lazily."
Anthropic later published a postmortem, admitting that on April 16 they had added a "reduce verbosity" instruction to the system prompt. This instruction, along with a few other changes, dragged down coding quality. On April 20 they rolled it back. But my sense is that after the rollback, Opus 4.7's laziness only eased slightly. It didn't fully recover. The RL layer had already internalized this tendency. You can't fix that by tweaking a system prompt.
In extended continuous operation like /goal, this laziness gets amplified. A lazy model might get away with it on short tasks. Put it on a long task, and it will find all sorts of seemingly reasonable excuses to fail itself.
4. These Past Few Months, We've Been Doing the Same Thing
/goal didn't appear out of nowhere. It's the culmination of months of exploration.
Before the Lunar New Year, I was already tinkering with something similar. At the time, I was doing stability testing for AIMA (our model management platform). The core idea was to have AI simulate real users running tests repeatedly to improve stability. The most naive attempt was using the terminal's built-in task mechanism. I set up 10 tasks, each running for a long time.
This path died quickly. Each task was still in the same session, and models don't hold up well in long sessions. Within a few rounds, things destabilized, and no amount of prompt tuning could save it.
Next I looked at a two-layer architecture. At the time, Kilo Code was pushing a feature called Orchestrator Mode, previously known as Boomerang Tasks, inherited from Roo Code. The logic was sound: an outer orchestrator manages tasks, delegates each subtask to an independent subagent running in its own context, then collects the results.
I tried a round with several cost-effective models available at the time. Zhipu performed slightly better, able to push through long tasks for a while. Minimax was more comical. It started writing code at the orchestrator layer itself and never delegated. The two-layer architecture simply failed on it. I thought about this for a while afterwards. It didn't seem like a harness adaptation issue. More likely, the model itself lacked the sense that it's the lead and should delegate.
In February, Claude Code shipped Agent Teams alongside Opus 4.6. It was experimental, requiring the CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS environment variable to enable. One session acts as team lead, dispatching other subagents to complete tasks in fresh context windows. This was essentially Kilo's architecture as an official implementation. I was genuinely impressed when I tried it. Long tasks could run for two or three hours without crashing.
But after one compaction, the team lead side fell apart. Previously dispatched subagents couldn't be found, so it would redeploy a new round, task lists got misaligned, and tokens burned fast. The two-layer architecture itself suffered information decay. Context shuttled back and forth between layers, losing a bit each time.
Then came Ralph, full name Ralph Wiggum. Australian developer Geoffrey Huntley built it at the end of 2025. The logic was so simple it was almost suspicious: a bash while-true loop, repeatedly feeding the same prompt file to an agent until the goal is achieved. I tried to test its tmux version at the time, hit some snags, and shelved it.
Ralph caught on extremely fast. It's the most direct inspiration for the /goal product line. Today, Anthropic has absorbed Ralph as an official Claude Code plugin, parked under plugins/ralph-wiggum/ in the repo. Kilo Code's Orchestrator Mode, conversely, has been officially marked deprecated. The reason given: "the main agent can now delegate directly to subagents, so a dedicated orchestrator is no longer needed."
Hand-rolled terminal tasks, to Kilo Orchestrator, to Claude Code Agent Teams, to Ralph going viral, to Codex shipping /goal, to Claude Code shipping /goal, to Ralph being absorbed and Kilo Orchestrator deprecated. The evolutionary thread of these past few months is clear.
5. Codex Dives In; Claude Code Keeps Looking Up
Back to the models themselves.
After running both, I have a fairly solid judgment. Codex is the "local" faction. Claude Code is the "global" faction.
Math has a concept called local optima. The optimization space is like a valley. Starting from one point and walking downhill, you might end up in a local minimum, but over the next ridge there's a deeper valley. I've watched Codex fall into these local optima repeatedly during /goal. It polishes one direction, does this and that, circles back, thinks it's moving forward, but is actually treading water. Its heads-down approach is usually a strength. In these moments it becomes a weakness.
Claude Code is different. It performs large-span reflection and validation, proactively asking whether its current direction is right. I've repeatedly seen it jump out of what looked like a converging direction, saying "wait, the root of this problem might not be here, I need to reconsider," and then actually find a better path.
This global view is Claude Code's strength. For complex tasks lasting one to two hours and requiring judgment, reflection, and cross-module coordination, I still think Claude Code outperforms Codex.
But this global view doesn't buy endurance. It can't run long under /goal, and can't deliver stable 24-hour unattended output. An imperfect analogy: Codex is an intern who can grind for 12 hours straight, occasionally drifting off course. Claude Code is a senior engineer with good judgment, but he needs to check in every 40 minutes, or decides after 30 minutes that this is too hard and he's out. Which is better suited for /goal? The answer is obvious.
6. Form Converges, Training Diverges
After running this comparison, I have an additional read on where coding agents are heading in the coming months. Harnesses are rapidly converging, but model personality differences will become increasingly prominent.
Boris Cherny (founder of Claude Code) has been saying that in the future, a harness might be just 100 lines of code. I believe this even more now. Once the /goal paradigm converges, the outer structure of coding agents will get thinner and thinner. A loop, a set of tools, a goal. That's enough.
What will truly determine the gap is the model's personality within this loop. Whether it's willing to put its head down and work. Whether it keeps popping out to align with humans. Whether its state survives compaction. Whether it can jump out when stuck in a wrong direction. When it hits a wall, does it try again, or say it can't do the goal and bail?
None of these can be fixed with prompting. They're set during training.
OpenAI and Anthropic have already trained distinctly different model personalities for long-horizon tasks. Codex seems to have been trained into "never give up, hit the wall and try again." Claude Code seems trained to "report frequently, align frequently, reflect frequently." That's endearing in interactive scenarios, but fatal under /goal.
In the short term, this divergence is hard to bridge. Even after Anthropic rolled back that verbosity system prompt, Opus 4.7's laziness only eased. It didn't fully recover. RL internalized it. You can't fix that by changing outer prompts.
7. Choosing an Agent Is Increasingly Like Choosing a Partner
At this point, the way I use /goal has changed.
I no longer start by asking which tool is stronger. Instead, I ask: which model's personality fits this task?
For iterations lasting over six hours, with a clear goal and low trial-and-error cost, I just fire up Codex /goal. For architectural judgment, cross-module decisions, possible mid-course direction changes, I use Claude Code /goal, but I check back every 30 to 60 minutes, mentally prepared for it to pop out with questions. For truly unattended 24-hour runs, it has to be Codex, and the task direction needs to be clearly nailed down upfront. If it's just a single hard problem requiring global vision, I actually don't use /goal at all. I use Claude Code in normal mode and knock it out in 30 minutes.
A few months ago, choosing an agent meant choosing UI, community, pricing. Now it's more about choosing a model personality.
Next-generation models, whether from Anthropic or OpenAI, will definitely train toward fixing the other side's weakness. Codex will try to add global vision; Claude Code will try to add endurance. In the short term, this personality divergence remains real, and it significantly affects how much value you can extract from /goal.
The biggest effect of /goal is that it amplifies a model's true personality into 24 hours of continuous output. The one with the steadier personality wins this round.
Right now, Codex leads by half a step. But only half a step.
References
- Claude Code 2.1.139 adds /goal command — explainx.ai: Claude Code
/goallaunch notes, May 12, 2026 - Claude Code Agent View, Goal Command, and Background Sessions Update — Geeky Gadgets: Detailed overview of Claude Code 2.1 features
- Inventing the Ralph Wiggum Loop — Dev Interrupted: Geoffrey Huntley on inventing Ralph
- Ralph Wiggum 官方 Claude Code plugin — GitHub: Anthropic has absorbed Ralph as an official plugin
- Kilo Code Orchestrator Mode (Deprecated): Current status of Kilo Code Orchestrator
- Orchestrate teams of Claude Code sessions — Claude Code Docs: Agent Teams official documentation
- Claude Code experimental agent teams — DeepakNess: Agent Teams release notes alongside Opus 4.6
- Claude Opus 4.7 Regression Explained — buildfastwithai: Opus 4.7 regression and community feedback
- Opus 4.7 isn't dumb, it's just lazy — Shimin Zhang: Analysis of Opus 4.7's laziness issue
- An update on recent Claude Code quality reports — Anthropic Engineering: Anthropic official postmortem on rolling back the verbosity system prompt
- GPT-5.5 Codex 400K context window — GitHub Issue: Codex 400K context window limit explained
- Boris Cherny on Claude Code's future — Pragmatic Engineer: The "100 lines of code" prediction
