Everyone Needs to Be a Leader: After a Week with GPT-5.5 and Opus 4.7

The density of model releases this April was a bit insane. Opus 4.7 dropped on April 16, the same day Codex CLI launched Goal Mode. On April 23, GPT-5.5 (codename "Spud") went live, followed on May 5 by GPT-5.5 Instant replacing ChatGPT's default model. In less than a month, every model I regularly use got refreshed.

By all rights, the experience should feel great. But after a week, I had two realizations that didn't match my initial expectations.

1. Models Are Getting Faster, But Human Expectations Are Outpacing Them

Models have been shipping fast this year. According to Epoch AI data, Anthropic's median release interval dropped from 168 days in 2024 to 71.5 days in 2026. OpenAI is basically on a monthly cadence, and the floor for the entire frontier has been pushed down to four to six weeks. Even so, that "why is this so dumb?" feeling still creeps in during usage.

Reflecting on it, this has less to do with the models themselves and more to do with the human expectation curve.

For example, I learned to dance for a while as a kid. The happiest phase was right at the start—I couldn't really dance at all, but posing in front of the mirror felt cool. The hardest part was the middle stretch. You'd just learned some basics and started watching high-level videos, so your taste improved far faster than your ability, and looking in the mirror became increasingly frustrating. Boxing was the same thing: before touching it, you think you punch like Tyson; after a few sessions, you can't even dodge properly.

Using AI is an extreme version of this curve.

Everyone remembers the leap when ChatGPT first came out, but people don't stop at that leap. When it's free, you want world-class consulting. Pay a little, and you expect it to solve a world-class problem in 80 minutes. A capability you wouldn't have dared imagine, once placed in front of you, drives expectations to rise far faster than actual capability improves. Models get stronger every six weeks; expectations might jump every six days.

I think this is the fundamental reason many people can't stick with coding agents. It's not that the tool is bad—it's that users haven't learned to adjust their expectations.

2. Collaborating with It Is Harder Than I Thought

My journey with agents has gone through several mental phases.

Phase one: low trust, watching every step to see what it could and couldn't do. Phase two: the opposite—realizing it seemed better than me at many things, so I might as well let it run loose. Phase three: I set an unattended task running on the Codex desktop app for two days, burned through an entire week's worth of tokens on one account, and ended up with a pile of detours and wrong turns—output that was worse than what I could have done in two hours myself.

That knocked me back a bit. I started working more closely with it again, while also letting it run automatically at night to compare the differences. The prompts weren't complex—same lightweight directive style I use during the day, giving ample context.

Three problems became glaringly obvious in automated scenarios.

Problem One: Giving Up Too Early

This was the first to surface.

Claude's behavior was the most dramatic, with very anthropomorphic wording. Whenever it hit a slightly complex step, it would say something like, "This session has gone on long enough tonight; I suggest wrapping up here, getting some sleep, and starting a new session tomorrow morning." The reason given was usually, "The next part might burn a lot of tokens, so you should be careful."

I later added a direct line to the prompt: "We are not sleeping tonight. Do not mention sleep again."

This isn't the model being lazy; it's more like an explicit tightening of training objectives. Anthropic's own 4.7 release docs put it bluntly: the model now more strictly adheres to effort levels and will "limit the scope of work to what was asked, rather than proactively overstepping boundaries." This behavior makes sense for most everyday requests. But in automated tasks, it backfires, because much valuable exploration precisely requires stepping slightly beyond boundaries.

GPT-5.5 isn't as dramatic as Claude; its premature abandonment is more subtle. From what I observed, its strategy is roughly: do a shallow probe first to confirm whether a direction has immediately visible opportunity, then decide whether to go deeper. The strategy itself is fine; the problem lies in the second half. Once the probe returns a slightly negative signal, it quickly shuts down the entire direction and jumps to the next one, almost never circling back to question whether the probing method itself was flawed.

Problem Two: High-Potential Directions Get Cut Off Indiscriminately

The second problem extends from the first, but it's more regrettable.

When humans make research judgments, one of the most valuable abilities is recognizing that some directions have "something here, but it'll take digging to reach the bottom." These directions start with high uncertainty and require sustained investment, but the payoff when you're right is massive.

Current agents, trained for token efficiency, are especially unfriendly to this type of direction. They make a "no-go" judgment based on very simple probing, then permanently close off that direction. The next time they continue searching within a space that excludes this direction, they end up bypassing the most worthwhile hard problems, landing instead on easy but low-value local optima.

This made me reappreciate the value of human intuition for direction.

This isn't the value of "I know more than the AI"—quite the opposite; in many domains, I don't know nearly as much as it does. But given its substantial domain knowledge, I can read scattered signals and tell it, "This direction is worth digging deeper into, even if you don't see results yet." Such judgments increase total token usage, but they produce real, tangible differences in output.

Problem Three: Cross-Session Strategic Inconsistency

The third problem is the most headache-inducing.

When I'm using it manually, context is continuous—a session lasts several hours, and I can constantly fine-tune direction. But in automation, the agent repeatedly starts new sessions, reading context from external documents each time. In theory, as long as the documents are complete, a new session should align with the old one.

In practice, it doesn't work that way.

With the same context, different sessions can produce dramatically divergent directional judgments. One session enthusiastically concludes that direction A is worth deep exploration; the next session, looking at the same notes, might set A aside and go probe B; the session after that jumps to C. From the outside, the work looks haphazard—jumping randomly from one thing to another—with almost no chance of coalescing into real results.

This is exactly what's called context rot in the industry. It means context "rots" across sessions—the model finds it increasingly hard to stay consistent with earlier decisions, and each new session tends toward local optimization. In the past couple of years, a wave of tool companies focused on "persistent context layers" (Augment, Hindsight, OneContext, and the like) has emerged to tackle this. But what tools can solve is just moving context more completely; moving doesn't solve the core problem that the model's judgment differs each time it reads through.

3. So Everyone Needs to Be a Leader

Looking at the three problems together, one thing becomes clear: the role humans need to play in this collaboration is shifting.

It's less like a traditional engineer and more like a leader.

The core of leadership isn't knowing more than your team. Often, leaders aren't specialized in specific technical directions. Its core is making directional judgments amid uncertainty, and having the ability to stick with or adjust those judgments between scattered signals and repeated setbacks. These capabilities used to be needed mainly by people managing teams; now every agent user has to shoulder them.

Unlike traditional leadership, this "team" of agents costs almost nothing. A real human team runs hundreds of thousands to over a million per month; running a few agents costs a few thousand. Costs lower by several orders of magnitude mean every ordinary person can have their own "team." But conversely, decision-making can no longer be delegated. Before, you could count on senior team members to judge direction for you; now agents won't judge for you—they'll very cooperatively follow your guidance, even if the direction is wrong.

There are several things it can't do that you must fill in. When it gives up too early, you need to know whether the direction is worth trying a bit longer. When it shuts down a high-potential direction, you need to recognize that and steer it back. When it drifts across sessions, you need to actively maintain the global direction so each new session doesn't judge from scratch.

These abilities don't sound like programming skills; they sound more like PM or TL skills. But in 2026, they are becoming basic competencies for every ordinary user.

The era where everyone needs to be a leader has truly arrived.

Everyone Needs to Be a Leader: After a Week with GPT-5.5 and Opus 4.7

1. Models Are Getting Faster, But Human Expectations Are Outpacing Them

2. Collaborating with It Is Harder Than I Thought

Problem One: Giving Up Too Early

Problem Two: High-Potential Directions Get Cut Off Indiscriminately

Problem Three: Cross-Session Strategic Inconsistency

3. So Everyone Needs to Be a Leader

References

Recommended Reading

Subscribe to Updates

Comments