I previously wrote three posts about edge AI, discussing cost, power dynamics, and legal boundaries. All of them addressed "why edge AI is worth doing." This post takes the next step and talks about what this device should actually look like.
Large Models Don't Run Well on the Edge
We've tested quite a few boxes—AMD's, our own—and the conclusion is the same: running large language models doesn't work. It's not that a specific product doesn't work; this entire class of devices hits the same wall. Prefill performance can't scale up, and decode is barely passable.
Some devices include an NPU, rated at 50T compute, which sounds impressive, but it can't actually chain together with the GPU and CPU. After fiddling with it, you realize you can't use it at all. Chip vendors definitely prioritize optimizing for the latest generation—after all, they need to compete with NVIDIA. What's placed in edge devices is last-gen hardware, one layer weaker in optimization, and the ecosystem can't keep up.
The supply chain is even more troublesome. Decent memory: buy 10,000 units in January, price doubles to 20,000 in February, and you still can't get stock. Trying to solve this by throwing hardware at it simply doesn't work.
What's Actually Blocking Users Isn't Compute
You install an Agent platform, and it immediately starts asking you to connect external services. Want memory? Connect to OpenAI's embedding API. Want voice? Connect to ElevenLabs. Images go to Google, PPT generation connects via MCP. Each requires separate registration and separate payment. For users in China, it's worse—Kimi might not even offer an embedding API, so you have to scramble to find other accounts to fill the gaps.
Miss one service, and the Agent gets noticeably weaker.
These auxiliary capabilities—embedding, TTS, ASR, OCR, visual understanding—none of them are large, and none are compute-hungry. But combined, they're what determines whether an Agent is actually usable. Edge devices can't beat the cloud on large models, but these small models? They can run them just fine.
Don't Waste the NPU
The NPU has 50T compute but can't chain large models? Then shove an OCR model in there, or run a VLM for vision tasks. I'm not sure what's most suitable—it depends on the specific chip architecture—but NPUs were originally designed for CV scenarios, so vision tasks can probably leverage them effectively.
Let the GPU handle TTS and embedding. Neither model is large—embedding is a few hundred MB, audio models are one or two GB.
I tested this on an AMD box: pure CPU running Alibaba's TTS model, a tiny thing with just over 100 million parameters, takes twenty-something seconds to generate a short sentence. No acceleration engine, vLLM doesn't support it, only native transformer inference. Switching to GPU is a completely different experience. ASR can actually run fine on CPU, so the configuration becomes: GPU handles synthesis, CPU handles recognition, NPU handles vision.
Configured this way, one device runs embedding, TTS, ASR, OCR, and VLM all locally. Users don't need to connect any external accounts—the Agent's auxiliary capabilities are simply there.
Make It Like a Router
A user came to me and said he accidentally turned off his computer, and discovered the AI wasn't responding. Pretty funny—an e-commerce guy specifically came to ask about this.
But think about it—this isn't the user's fault. Make it look like a computer, and people will turn it off. Laptops are meant to be carried around; you close them when you're done. But Agents need to run 24/7—they break when you shut them down.
I later decided this thing shouldn't look like a computer at all—it should look like a router. When was the last time you turned off your home router? It runs quietly in the corner; you never even think about it. Good heat dissipation, low noise, controlled power consumption—just leave it there and forget it.
A Mac mini on steroids. 16GB of memory is enough. We previously planned a 64GB solution, then realized it was completely unnecessary. 16GB unified memory is more than enough for auxiliary models, plus the compute from GPU and NPU separately—feeding these small models is no problem.
Why Enterprises Would Buy a Fleet of Them
Enterprises wanting to use digital employees face the biggest headache: data isolation.
Cram all digital employees into one server, separated by Docker containers? Agents have too many permissions—give them permissions and they can break through. You can think of all sorts of ways to block them, but if they're smart enough, containers won't stop them. I've been thinking about this problem too—doing isolation on a single machine makes the architecture extremely complex.
The dumbest and most effective method is physical isolation. One box per digital employee. The product manager's box only has product-related data; it can't touch finance. Finance's box can't touch the code repository. Physically separated, you don't need to worry about permission escalation at all.
A friend is already doing this: 4 Mac minis plus 4 cloud models, 8 digital employees each managing their own. So an enterprise might not buy just one, but buy a fleet. Each unit only needs enough power to run auxiliary models—what you want is quantity, not how powerful a single machine is.
That Camera Data Problem
Many homes have surveillance cameras installed, but have you actually looked through those recordings? Probably not. Uploading to cloud analysis is expensive and unsettling; not uploading means they sit there gathering dust.
I've always felt this is the most classic scenario for local vision models. The device is idle at night anyway—let the box run through a day's camera footage, flag anomalies, and give a summary the next morning. No rush, take your time calculating. No cloud API calls, no per-token charges, data never leaves the house from start to finish.
Enterprise campuses face the same problem—massive amounts of private video data. Give it time to process; no need for real-time, but it needs to be cheap, and the data must not leave the premises. Handing this scenario to the cloud, however you calculate it, doesn't make sense.
Build in This Direction, and It Can Work
Bottom line: don't use edge devices to compete with the cloud on large models. Run auxiliary models—embedding, TTS, ASR, OCR, VLM—and put GPU, NPU, and CPU all to work. Make it look like a router, keep it online 24/7, with 16GB memory. One for individuals, a fleet for enterprises, physically isolated.
The previous three posts explained why it's worth doing; this one explains what it should look like. I now feel the direction is basically right. What's left is who builds this product first and makes it cheap enough.