No More Babysitting the Agent

I spent two days automating with the Codex app, burned through a Pro account, and made virtually no progress. Switching to Codex CLI's /goal feature, everything immediately clicked.

It felt counterintuitive at first. Same model, same task—how could swapping the shell make such a huge difference? Two days later, I figured it out: it's not the app's fault. The agent form factor is still in transition, and the app is making a "mass-market" push that runs ahead of where the models actually are.

What surprised me more than the burned account was something else entirely. Mindset.

1. A Puzzle

There's something I haven't been able to figure out lately, so I'm putting it out here for discussion.

The same coding agent behaves so differently in the terminal versus the official app that they barely seem like the same entity. It runs beautifully in the terminal, but noticeably dumbs down when moved to the app.

Theoretically, there shouldn't be such a gap. An app is just a shell—wrap a framework, change the visuals, behavior should be consistent. But power users at the front lines have already voted with their feet: the vast majority still live in the terminal. The app has been pushed for so long without truly taking off.

After testing this myself, I found it's probably more complicated than it looks.

2. Codex App "Automations": Running Through Every Pitfall in Two Days

It started with that post-GPT-5.5 feeling of "this model is already strong enough."

5.5 pushed Codex's end-to-end execution capability up a notch. In long-horizon tasks, it spontaneously chooses to "verify first, then advance." My sense is that one loop can sustain roughly 30 minutes before it naturally pauses at a milestone. After watching it for days, I realized that in the terminal I was basically just saying "continue" on repeat. Its next-step judgment was already reliable enough.

So if that's the case, why not make it highly automated? After all, it was just "continue."

I remembered that the Codex app previously had a Routines feature—perfect. I downloaded it, only to find Routines were gone, renamed to "Automations." I read the docs; the capabilities looked similar, so I got started.

The first pitfall was the trigger mode. It supports fixed-time execution or interval triggers. I set a custom interval to fire every 30 minutes, letting it run continuously around a single goal for two days.

I initially wanted to simulate the terminal experience: the same session repeatedly "continuing," checking the results every few iterations, pulling it back if it drifted off course. In the app, this maps to thread automation—hanging a heartbeat-style timed wake-up on the current session.

It sounded reasonable, but testing revealed a hidden limitation: within a single session, automation can only successfully trigger once. After the first run, the system swallows subsequent heartbeats, reasoning that it wants to prevent infinite loops. This made continuous iteration impossible.

Stepping back, I switched to standalone automation: spawning a new session each time, triggered on a schedule. This could run, but at two costs.

First, price. Every new session carries no historical context, so previous state has to be handed off as documents. With no cache hits, the new session crawls and reads all relevant documents from scratch. Under OpenAI's prompt caching mechanism, cached tokens cost roughly one-tenth of uncached ones. That immediately inflated costs by 10×. My gut feeling checks out: running three independent automations simultaneously on one Pro account burned through the quota in about two days.

Second, performance. I narrowed the automation scope so it would only attempt small, concrete problems at night. Checking in the next morning: eight hours of runtime, virtually no progress.

This fell far short of my expectations for this model in the terminal. When I repeatedly said "continue" for four hours in the terminal, the output quality was consistently higher than these eight hours of unsupervised runtime. At first I thought the model itself wasn't that strong; I later rejected that hypothesis. The app path is the problem.

3. Codex CLI's Goal: The First Respectable Long-Distance Run in the Terminal

While researching, I noticed that Codex CLI had added an experimental feature called /goal in version 0.128.0.

Its design logic is straightforward: you give it a goal (not a single prompt), and it continuously loops through plan → act → test → review → iterate until the goal is achieved or the budget runs out. State persists across sessions; it can be paused, resumed, or cleared. This is essentially the exact thing I had been struggling to simulate with app automations.

I was setting up a new machine anyway, so I enabled this feature to try it out. My conclusion: the feel is close to what I expected. Long-horizon execution holds up; most of the time it pushes forward around a goal on its own without deviating too wildly. This is the first time I've seen a truly respectable "long-distance" form factor in the terminal.

This is where it gets awkward. The app's polished interface, visual diffs, and chat bubbles—ultimately less effective than a plain terminal loop.

Why?

Honestly, I haven't fully figured it out. The app isn't open source; I can't see how it's actually implemented behind the scenes. But one observation might hold some weight: the products actually doing long-horizon tasks (Codex terminal Goal, Boris's own Claude Code setup, community projects like OpenCode) all converge on the same form. A loop in the background—model plus tool calls plus context appending, running repeatedly. State persisted via files, git, or worktrees; the UI is just an observation window.

Meanwhile, the app experience, designed around a chat dialog, becomes a drag in long-distance scenarios. It treats every interaction as an independent "Q&A," breaking the shared context that should persist, altering the prefix that should stay cached, and artificially fragmenting the "next step" that the model should judge for itself into multiple separate events.

4. Letting Go for 24 Hours: A Shift in Mindset

Returning to that two-day experiment—setting aside cost and performance—I noticed something completely unexpected. Mindset.

Before, when using an agent, no matter how automated it was, your mind was still tethered to it. You'd constantly wait for the next turn, watch what it was doing, judge whether to intervene. This burden isn't in the hours worked; it's in your awareness. You haven't truly "handed it off"; you've only delegated execution, while decision-making remains with you.

During those 24 hours when it was running on three branches by itself, I felt "no need to watch" for the first time. I knew it was working, knew it had the capability, so I let it run. That day, I did other things. Ate dinner with family, read some philosophy, watched a movie.

The results fell short of expectations; on paper, it was a loss. But that shift in mindset felt like a real inflection point.

I hadn't realized before just how heavy the cognitive tax of "keeping an eye on the agent" actually is. You think you're not spending time; true, you only check in for a few minutes here and there. But your attention remains anchored to it. Once letting go becomes viable, that entire block of time truly belongs to you again.

This clarified for me the evolutionary stages of agents.

Stage One: Human-centric. Chatbots, code completion. AI is a productivity tool; humans must pay attention to every detail.

Stage Two: Human-directed agent. This is the mainstream form today. Let the agent run wild and it'll mess up, so you have to guide it through the process, babysitting it as it works. Here's the counterintuitive part: this stage is actually more exhausting than doing it manually. Your output is much higher, but the physical and mental toll is heavier. My family sees my state changes most clearly. She says I'm truly on-call 24/7 now.

Stage Three: Agent autonomy. Humans only define goals and occasionally supervise; the agent handles the rest. I think 5.5 is the first model that makes this stage look genuinely possible.

Stage Two is a transitional state. Its defining feature is "expectations exceed capabilities." You want the agent to work independently like an employee, but the model is still one step short, so you have to fill the gap. By Stage Three, that filling-in is no longer needed.

5. Model + Goal + Loop

Boris Cherny (founder of Claude Code) has been repeating a prediction lately: within a year, Claude Code might be down to just 100 lines of code.

His logic goes like this: most of what's currently in the harness—permission gates, context management, tool routing, prompt-injection guards, human-review hooks—exists because "the model isn't smart enough yet." Once the model can make the right judgments on its own, all of that becomes baggage. What remains as the truly necessary core is just one loop:

while (model returns tool calls):
  execute tool → capture result → append to context → call model again

This is the shared foundation underlying all frameworks: Claude Code, Codex, Cursor agent, Manus. One loop, plus a goal and tool calls.

This judgment aligns with my own intuition. Features like /goal work because the model itself is already stable enough on "what should I do next." You no longer need external constraints forcing it to do A before B; it plans on its own, judges on its own whether to step back and verify.

Once the form factor solidifies, the next evolutionary direction becomes clear: one person overseeing multiple agents, each running in a loop around its own goal, potentially even collaborating with each other. It's an agent team, not an agent copilot.

6. The Token Logic Flips

There's an interesting side effect here: token consumption will decouple from headcount.

Most paying users currently suffer from "token anxiety." Prices are settled weekly or monthly, so you feel compelled to squeeze every drop out of your quota, or else you feel guilty. The underlying logic is similar to hiring people: you're buying time, so you squeeze them 24/7.

But this mechanism is actually quite distorted relative to real value creation. Not every task needs to run 24 hours a day; in fact, most don't.

However, in Stage One the bottleneck was "capability"—the model wasn't strong enough to independently accomplish many things. In Stage Two the bottleneck was "human"—you only have so much attention, and juggling three agents is your ceiling. By Stage Three, the bottleneck shifts to "goals": how many valuable directions does one person actually have that are worth pushing an agent toward?

Once this holds true, token consumption has nothing to do with user count. Some people have never used an agent at all; others are already spending over a thousand dollars a month on agents (I'm approaching that range myself). This divergence will only intensify. One person with 100 agents running 24/7 for them is entirely plausible.

The near-term phenomenon: token demand will explode further, but the shape of that demand may be completely uncoupled from user growth. A small number of people in heavy-use scenarios will consume the vast majority of compute.

7. What to Do with the Liberated Time

This brings me back to that mindset shift I mentioned at the beginning.

Industrialization has placed humans in a contradictory position: managers expect employees to work 24/7, have autonomy, and yet be supervised in real time. These demands are inherently contradictory. Because "wages are paid monthly," managers will push this contradiction to its absolute limit.

Agents have no wage contradiction; they're settled by the token. In theory, you can let them truly run 24/7. The prerequisite is that they can produce without supervision. I think that prerequisite is beginning to hold true post-5.5.

Once it does, what does it mean for humans?

My family understood this better than I did: humans shouldn't be defined as production units. That "unsupervised" state I was in for two days led her to say directly: that kind of tense state you're in is unsustainable from a health perspective. She was right. The fact that Stage Two is "more exhausting than manual work" isn't a complaint—it's actually happening. You're constantly searching for ways to collaborate with the agent, and that exploratory cost is real on a mental level.

If Stage Three truly materializes, what can people do with the freed-up time? Express, connect, pursue things they're interested in. A simple answer, but the more I think about it, the more it feels like the direction actually worth taking. Studying philosophy, watching films, returning to family—whatever adds value on the dimension of being human. These are things agents can't replace.

8. How We Evaluate People Is Changing

Finally, a more macro-level observation.

The entire value system is being reshaped. The old way of measuring someone's output was by their hours worked, lines of code, or PR count. These metrics will become increasingly hollow in the agent era. Boris alone filed 150 PRs in a single day—how do you evaluate him by PR count?

What will truly differentiate people in the future is leadership and goal orientation. Can you define high-quality goals? Can you identify genuinely valuable directions? Can you get a team of agents running toward the same North Star?

These qualities mattered in "human-human collaboration" too, but you could still fall back on organizational processes, KPIs, and incentives. In "human-agent collaboration," that fallback disappears. A poorly defined goal is direct waste—100 dollars burned overnight on a direction that drifts off course.

Letting go is hard. But some things only move forward once you let go.

No More Babysitting the Agent

1. A Puzzle

2. Codex App "Automations": Running Through Every Pitfall in Two Days

3. Codex CLI's Goal: The First Respectable Long-Distance Run in the Terminal

4. Letting Go for 24 Hours: A Shift in Mindset

5. Model + Goal + Loop

6. The Token Logic Flips

7. What to Do with the Liberated Time

8. How We Evaluate People Is Changing

References

Recommended Reading

Subscribe to Updates

Comments