Tags:

Infra

5 posts

The Hidden Thread in Token Business: Cost Is Set by KV Cache Hits, Not Throughput

When people estimate token costs, they usually watch TTFT, TPOT, and throughput. What actually makes bills differ by 10× is whether the KV cache hits. The model, server, and user layers all have to line up.

May 28, 20265 min read

AI Token Infra Compute Thoughts

A Token Is Not a Thing

Demand for GPT-5.5 and Opus 4.7 is nearly infinite, the mid-tier has vanished, and low-to-mid-range compute sits idle. The token economy sounds like selling electricity, but it's more like a gas station: 98-octane is sold out, diesel tanks are full yet self-service only, and 95-octane sits empty.

May 26, 20267 min read

AI Inference Optimization Infra Reflections

When a Model Is 5× Faster, It’s No Longer the Same Model

The Gemini 3.5 Flash launch barely talked about intelligence; Zhipu’s GLM-5.1 High-Speed edition hit 400 token/s. Both tell the same story—once inference speed crosses that 5× threshold, the model unlocks an entirely different product category.

May 22, 20266 min read

AI AI Agent Infra CPU Thinking

The Most Expensive Waste in the Agent Era: GPUs Waiting on CPUs

I ran seven hundred rounds of AI Infra experiments, and thirty-five hours were entirely eaten up by environment startup. At first I thought GPT-5.5 fast mode wasn't fast enough, but later realized it wasn't the model thinking—it was the model waiting for the CPU. Intel has already tightened the server CPU:GPU ratio from 1:8 to 1:1.

May 7, 20266 min read

AI DeepSeek Infra Model Thoughts

DeepSeek V4 Day: It's About Infra, Not the Model

V4 capabilities sit around the Opus 4.6 tier, but pushing FP4 to production, making million-token context the default, and day-0 adaptation for domestic chips is a disaster for everyone in the inference infra business. Add GPT-5.5, Vision Banana, and LPM 1.0 into the mix, and this week has crammed in more new releases than the entire past quarter.

April 24, 20267 min read