You often see people online saying that in the AI era, reliability matters most.
The first time I saw it, it sounded like a tired cliché. Every era gets assigned its own buzzword; "intelligence" and "execution" have already had their turn. Does "reliability" actually fit the AI era any better? Not really.
A few recent projects finally drove the point home.
Three Characters a Day, Thousands of Assets Over a Hundred Days
I made a PAW Patrol reading game for my son. Three characters a day, three hundred over a hundred days. The number looks small.
The thing is, each of those hundred days is an independent mini-level. Every day needs about six or seven images and dozens of audio clips. The voices are cloned from a PAW Patrol character, each line matching a specific script. Add it up across a hundred days, do the math, and you're looking at thousands of assets, easy.
The day-one demo worked. My son and I sat there playing for twenty minutes, having fun. That's exactly where the problem started.
I thought the rest was just running that demo ninety-nine more times. Turns out the real thing and the demo are two completely different beasts.
I randomly picked two clips from the first batch of twenty. Both were bad. One dropped two characters; the other's emotion completely mismatched the line. Would I dare use the other eighteen? I sampled again. Still broken. With over a thousand assets across a hundred days, how many would actually be usable? I had no idea.
That feeling of uncertainty is the critical part. It's not a minor issue; it stops you cold.
The Extra Layer Is Called Quality Assurance
Demos work because a human is watching. Generate one, listen, if it's no good try again, pick the best and keep it. The whole process is manual. The human is an invisible QA layer.
To make it fully automated, you have to swap "human in the loop" for "model in the loop." That's QA.
Sounds simple. Doing it opens a whole new world.
One Audio Clip, Scored by Three or Four Models at Once
I started digging into niche models in the industry. The usual suspects are ASR and TTS. ASR understands speech, TTS generates it. But scoring TTS output for quality? There's a whole category of models built just for that.
DNSMOS is from Microsoft. Originally built to score noise-suppression algorithms, it doesn't need the original clean audio as reference; from a single clip it judges how much noise is present and whether the overall result is listenable. Later people found it's also sensitive to TTS artifacts.
NISQA comes from Gabriel Mittag's team at TU Berlin. It includes a NISQA-TTS weight specifically for TTS naturalness. Instead of a single score, it breaks things down into dimensions: noise, coloration, discontinuity, loudness.
UTMOS is from the SaruLab team at the University of Tokyo, winner of the VoiceMOS Challenge 2022, and now the de-facto baseline for TTS scoring. I use it as the outermost backstop.
Finally there's a reverse ASR pass: feed the generated audio through Whisper, compare the transcript to the original script, and reject it if the gap is too big. It's the crudest check, but the most reliable.
Add up the four scores; pass the threshold and it's good, fail and it triggers regeneration. I spent a day wiring it up, and the output was clearly better than running TTS alone.
From 1× to 3× Is Not an Exaggeration
But the cost went through the roof.
Before adding QA, I figured it would add maybe 50% more time. Pick out the bad ones and regenerate. Worst case, the new batch is all bad too, so you run it again. At most 1.5×.
In reality, one hour became three.
The reason is that the model simply can't clear a certain line. The same script, different random seeds, seven, eight, nine tries and it still can't pass the QA threshold. Sometimes you have to fall back to changing the prompt, the speaking rate, the emotion tag, just to squeeze it through. Every regeneration is a full model call, burning tokens each time.
Run this in the cloud, billed by the minute or by the call, and racking up a hefty bill in minutes is no exaggeration. I later did a rough calculation: a voice generation task I had planned to run entirely in the cloud would cost roughly 7 to 10× the demo bill once you factor in retry rates.
This was the math I hadn't done: from demo to production line, costs jump by orders of magnitude, not percentages.
The Same Thing Happened Again on Another Project
Recently I've been playing with another project called MiroFish.
It's a pretty interesting open-source project by Guo Hangjiang, an undergraduate in China. It hit #1 on global GitHub Trending in March 2026, with investment from Chen Tianqiao of Shanda Group. It generates a large population of agents with personalities, memories, and social relationships, then runs them across two simulation platforms where they discuss, debate, form alliances, and shift opinions. Finally, a ReportAgent summarizes the conclusions of the entire evolution to predict how an event will unfold.
My config wasn't large. About 54 agents per event, across a 20-round timeline. Every round requires every agent to run once. 54 × 20, roughly 1,000 full calls.
I used Kimi K2.6 Thinking. The problem is you can't turn off thinking mode; it thinks before every output. Thousands of thinking tokens per call is normal. Multiply by 1,000, and the token burn hurts.
After a few runs, I started wondering: does this scenario really need a top-tier model?
Each agent, on its turn, just scans the context, says a line or casts a vote based on its persona, then gets aggregated. The intelligence threshold for each call is actually low. Swap in a last-year model, something around GPT-4o level, and the results are probably similar, only faster.
The Mid-Tier Slot That No One Has Clearly Defined
For the past year, one question has gone unanswered: what scenarios actually need a second-tier model? Everyone races for the best and most expensive, leaving the mid-tier in an awkward spot.
I now see two very specific slots.
First is quality assurance. Judging whether an audio clip sounds natural, whether an image matches the style, whether a conversation stays on topic, these tasks require mid-tier intelligence. Using a top model here is like using Claude Opus to review GPT-4o's code. It works, but it's not cost-effective. A lightweight vision model plus a specialized scorer like NISQA costs far less than one top-tier call.
Second is large-scale agent simulation. A setup like MiroFish strings together 1,000 inferences to reach a collective evolutionary result. It's not sensitive to the quality of any single call, but extremely sensitive to total cost. The "best model" for this scenario isn't the smartest; it's whatever gives you the best mix of per-token price and inference speed.
These two scenarios hadn't been clearly spelled out because no one actually doing industrialization was batch-producing content at scale. Once you actually need to generate thousands of audio clips or tens of thousands of agent inferences, these two slots jump right out.
The Second Reason for Going Local
This is also when I finally understood why local compute matters so much.
The two reasons usually cited for local deployment: speed and privacy. Both are valid, but neither is the most critical.
The real killer is cost per call.
An industrial pipeline is bound to retry heavily. Cloud TTS is billed by duration, token models by the call. Every retry is another invoice. Local is different. A DGX Spark running open-source models like F5-TTS or VoxCPM incurs zero marginal cost beyond electricity. Leave it running for a day and you get enough material for a week. Failed? Run it again, no big deal.
This is the fundamental difference between cloud and local models in industrial scenarios. The former charges by usage; the latter only charges once for hardware. In a high-retry-rate pipeline, that gap gets magnified by orders of magnitude.
The reason local deployment never made sense in past discussions is that everyone compared it to demo costs. A demo TTS run costs pennies. Set that against a local machine costing thousands, and the math never works. But compare it to industrial-scale costs, factoring in retry rates, QA, and agent simulation, and the math flips immediately.
Three Tiers, Three Positions
Writing this, I suddenly realized that industrialized AI content production actually needs three tiers of models running simultaneously.
-
Top-tier models at the front, handling the hardest generation tasks. Expensive per call, but you don't run them often.
-
Mid-tier models for QA and agent simulation, handling high-volume, medium-intelligence tasks. Called repeatedly, so each call must stay cheap.
-
Local models at the bottom doing the heavy lifting. Asset generation, vectorization, transcription, alignment, the grunt work. If it can run locally, don't send it to the cloud.
You won't find these three tiers in any official tutorial; the setup is still evolving. But once you actually get your hands dirty with industrial content production, you'll end up piecing together this structure yourself.
Looking back at that opening line that sounded like empty talk, I actually think it understated things. In the AI era, what matters most isn't "reliability" itself; it's the cost curve of reliability. From demo to production line, that curve starts at 3×.
Understand that curve, and you know how to spend money. Otherwise you'll budget for 1× and get a bill for 7×.
References
- DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors (Microsoft Research)
- DNSMOS P.835 ICASSP 2022 Paper (arXiv)
- NISQA: Non-Intrusive Speech Quality and TTS Naturalness Assessment (GitHub)
- NISQA Speech Quality Corpus (TU Berlin DepositOnce)
- UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022 (GitHub)
- F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching (GitHub)
- VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation (OpenBMB)
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (OpenAI)
- Kimi K2 Thinking API Pricing (OpenRouter)
- MiroFish: A Simple and Universal Swarm Intelligence Engine (GitHub)
- MiroFish: Swarm Intelligence with 1M Agents That Can Predict Everything (Medium)
