Skip to main content
Blog
When a Model Is 5× Faster, It’s No Longer the Same Model

When a Model Is 5× Faster, It’s No Longer the Same Model

The Gemini 3.5 Flash launch barely talked about intelligence; Zhipu’s GLM-5.1 High-Speed edition hit 400 token/s. Both tell the same story—once inference speed crosses that 5× threshold, the model unlocks an entirely different product category.

Jiawei GuanJiawei Guan6 min read
Share:

Two releases caught my eye this week.

On May 19, Google released Gemini 3.5 Flash. I watched their launch event. Oddly, they barely emphasized the model’s raw intelligence. Benchmarks against the previous generation didn’t exactly stand out either. But they devoted serious time to speed, calling it “frontier intelligence built for speed,” claiming inference is 4× faster than other frontier models.

Today, May 22, Zhipu also launched GLM-5.1 High-Speed, claiming 400 token/s output—the current ceiling for industry APIs. This engine wasn’t built by Zhipu alone; it was a joint effort with a team called TileRT, doing low-level customization specifically for the GLM model family on a specific class of hardware.

Put these two together, then look back at Anthropic’s Opus Fast and OpenAI’s GPT-5.5 Fast over the past few months, and the direction is clear: differentiation at the model layer is changing lanes. Everyone used to compete on smarts; now they’re increasingly competing on speed.

And once speed crosses a certain line, it stops being a linear “X times faster” improvement. AI becomes a different kind of thing.

1. Pricing Already Tells the Story

The clearest evidence is fast-mode pricing.

Anthropic’s Opus Fast: 2.5× the speed, 6× the price.
OpenAI’s GPT-5.5 Fast: 1.5× the speed, 2.5× the cost.

Look at the numbers. If speed were valued linearly, 2.5× speed would cost 2.5× the price, and 1.5× speed would cost 1.5×. But in practice, the price jumps far more than the speed.

This isn’t greed. It’s a real market signal: some people will happily pay disproportionately more for speed. Either their tasks need high-frequency feedback, or users are sitting there waiting, or downstream steps are blocked. In these scenarios, going from 30 seconds to 12 seconds feels completely different from going from 30 seconds to 20 seconds.

I toggle Opus Fast on and off constantly myself. I turned off GPT-5.5’s 1.5× tier immediately. I couldn’t feel the difference; it was just burning money. But at 2.5×, there are tasks I just leave it on for—mostly when I’m staring at the output and iterating fast.

Markets don’t lie. Something that sells for 6× has buyers who genuinely think it’s worth it.

2. Per-Request Speed and Scaling Out Are Not the Same Thing

Two things need to be kept apart here.

The first is “doing more concurrency at a fixed speed.” Same 30 token/s throughput, but serving 1,000 users instead of 100. This is relatively easy. Just throw more machines at it—you can buy slightly weaker cards and spread the load across them, and the cost-performance ratio can be tuned.

The second is “making a single request faster.” Going from 30 token/s to 400. This is an entirely different beast. You need higher-end hardware, more aggressive memory bandwidth, and cutting-edge packaging. You can’t fix this by “spending a bit more to stack extra cards.” A hundred weak cards won’t get a single request to the speed of one top-tier card.

I’ve spent time experimenting with inference infra myself, optimizing a few open-source models. The cost curves are completely different. The first is roughly linear—double the money, get about double the concurrency. The second is non-linear—that first 20% speedup might cost you 50% more, and it only gets steeper.

So when Gemini 3.5 Flash emphasizes speed, or GLM High-Speed hits 400 token/s, they’re not saying “we made a cheaper version.” They’re saying “we pushed single-request speed to a new level.” That’s a problem of an entirely different magnitude.

3. 5× Is a Speciation Threshold

So why push so hard?

When I think about this, I go back to a simple comparison.

If you want something done faster, the traditional options are limited.

First, hire smarter people. But that hits a ceiling. There are only so many world-class experts, and today’s best models are already brushing up against that ceiling.

Second, make people work overtime. Agents already run 24/7, so that ceiling is gone too.

Third, divide the work and throw more people at it. But anyone who’s done engineering knows adding people doesn’t scale linearly. Adding one person doesn’t make it twice as fast; adding ten gets you nowhere near 10×. You have to break things down, hand off, coordinate, deal with uneven quality, manage waste. The ramp-up period for new hires is expensive. If you’re doing multi-agent orchestration, you know exactly what I mean.

At this point, the traditional paths to speed are tapped out.

So what’s left? Make the model itself—the same employee—faster.

And making that “same employee” faster is non-linear.

Imagine an employee who used to take an hour to finish a task. Now they do it in ten minutes. You think you just saved fifty minutes? It’s more than that.

You’ll start giving them tasks you’d never have bothered with because “it’s too slow.” Small ad-hoc requests that used to take an hour—so you never asked—now come back in ten minutes, and you make a dozen a day. Speed unlocks tasks that literally didn’t exist before.

I saw a demo the other day: someone wearing glasses pointed at a video on a screen and said “zoom in on this,” and the AI behind it wrote code to resize the element. If the whole chain takes thirty seconds, you glance and walk away—there’s no real interaction. But if it finishes in five seconds, the feel is completely different; it becomes a genuinely usable product.

That’s the gap between 50 token/s and 400 token/s. 8× speed unlocks products that were impossible to build before.

A speedup beyond 5× is a speciation line.

4. The Return of Specialization

Okay, speed is valuable. How do you actually achieve it?

That brings us to TileRT’s approach, which diverges from where the industry was a year ago.

Mainstream inference frameworks like vLLM, TensorRT-LLM, and SGLang are general-purpose. They aim to “support as many models as possible, running well enough on as much hardware as possible.” That has always been software engineering’s default bias: generality first, performance second.

TileRT does the opposite. It statically schedules the entire inference graph at compile time, running as a persistent kernel on the GPU with almost no runtime dynamic scheduling. Micro-tasks at tile-level granularity squeeze the hardware close to its physical limits. The cost? Change the model and it’s basically scrap; change the hardware and it needs major rework.

DeepSeek is on the same path. Their own inference engine started out based on vLLM, then underwent more than a year of deep customization—almost every path was rewritten for their own MoE architecture. When they open-sourced part of it recently, the industry’s reaction wasn’t “how general-purpose this is,” but “how deep you can go for a single model.”

Go one layer deeper, and the hardware side has been on this path for a while. Groq’s LPU runs Llama 4 Scout at 460 token/s, 3–4× what an H100 delivers. Cerebras’s WSE-3 hits 1,800 token/s on a 70B model and nearly 3,000 on gpt-oss-120B. These are specialized chips. They aren’t trying to run every kind of model; they’re built to take a specific workload to the extreme.

Chip designers have debated this for decades: general-purpose CPU or specialized ASIC? General chips have their place, but when a domain is big enough and the lifecycle is long enough, specialization pays off.

The software layer used to avoid this, mainly because software isn’t cheap to write. Building a dedicated inference stack for one model takes too long to pay off; the model changes and your software is dead.

That’s changing. AI agents can write software now. The cost of “building an optimal inference stack from scratch for a specific model and specific hardware” drops every year. Once it falls below a certain threshold, specialization becomes the default.

Every promising model will eventually have its own dedicated inference engine. Every generation of mainstream hardware will have its own specially optimized stack. What you used to think of as just “the last 5% of optimization” could now become a 5× or 10× gap.

5. Vertical Integration at the Model Layer Is Inevitable

Pulling these threads together.

Intelligence will keep improving in the short term, but the marginal utility of competing on raw smarts is declining. A model that’s 20% smarter versus the same model accelerated 10×—for many users, the latter is far more valuable, especially for the new scenarios that speed itself unlocks.

So the next phase of competition shifts from “point intelligence” to “end-to-end capability.” Model, inference engine, and hardware—all three bundled together.

If you’re at 400 token/s and I’m at 30 token/s, even if my model is 20× smarter, I’m unusable in many scenarios. I’ll be watching my smartest model sit there slowly spitting out words while you’ve already delivered the whole product experience to the user.

DeepSeek and Zhipu are already doing this. Anthropic and OpenAI are too. Google probably went the earliest and deepest—the TPU + Gemini combo has been running internally for a long time. My guess is that over the next year or two, the whole industry moves this way: model companies must own their inference stack and go deep into the hardware layer; hardware companies must go deep into model architecture; and the generic middle layer gets squeezed from both ends.

For engineers, this is pretty exciting. We used to think “general, scalable, portable” was good taste. For the foreseeable future, the opposite may hold: writing the most extreme code for a specific model and specific hardware—code that breaks if you change anything—becomes worth doing again.

Software engineering aesthetics will have to change.


References

Recommended Reading

Subscribe to Updates

Get notified when I publish new posts. No spam, ever.

Only used for blog update notifications. Unsubscribe anytime.

Comments

or comment anonymously
0/2000