Why We Built AIMA: An Open Source Project for Managing AI Inference with AI

Constantly Installing Models for Others

Starting in 2025, a growing number of hardware vendors and customers reached out, expressing interest in our model management platform.

The reason was straightforward: the interface was easy to use, model-centric, and you could get things running with a single click. The barrier to entry was low. Some customers were even willing to pay us to deploy the same platform for them.

But that platform was originally designed for specific hardware. NVIDIA GPUs, Huawei Ascend, Hygon DCU, AMD ROCm—each had completely different drivers, device mounts, environment variables, and security contexts. Adapting to every new hardware type required writing massive amounts of code. It was painful.

What hurt even more was the aftermath. Customers kept coming back: help us adapt the latest model, help us upgrade the inference engine. No matter how much the platform sold for, it ultimately required continuous manual investment to maintain.

I started wondering: why do people actually need a platform like this? After going around in circles, the answer always came back to the same thing—TCO, Total Cost of Ownership. I wrote a dedicated post about the TCO trap of edge AI devices earlier.

Halving in Three Months

AI moves too fast.

More than ten new models can emerge within three months. You spend hundreds of thousands on an AI server, get it tuned, and run the best model available at the time. Leave it untouched. Three months later, a new model is roughly twice as capable. The hardware's value gets cut in half.

Add to that the progress in inference engines—vLLM, SGLang, llama.cpp—every release squeezes out more performance, widening the gap further. If you don't upgrade for six months, compared to the industry's best, your hardware might only retain about 30% of its original value.

The hardware isn't broken, but its value is depreciating.

Our previous platform worked great out of the box, but once you needed to import the latest models or adapt to engine changes, the software couldn't keep up. You can't know what new things will look like before they arrive, whether engine APIs will change, or if hardware drivers will suddenly update. There's no way to be future-compatible.

The Cost of Compromise

Some might say: aren't Ollama and LM Studio good enough? One command or one app, and the model downloads and runs.

My view is that this is a reluctant choice from a bygone era.

To lower TCO, you often have to simultaneously lower the actual performance of the device. To be more compatible and easier to use, these tools make a lot of compromise-driven presets. Take Ollama: one-click startup works fine. But if you want to tune advanced inference engine parameters or push higher concurrency? You'll hit walls.

A vLLM inference engine has dozens of advanced parameters. The difference between turning some on or off can be 50% in performance. Behavior varies across versions, and optimal values differ by model. Unless you've actually run it on real hardware, you have no way of knowing.

Lowering TCO and maintaining SOTA performance are in conflict with these solutions.

How to Minimize TCO for AI Inference Devices?

What AIMA wants to do is simple in theory: let devices deliver SOTA performance in as many scenarios as possible, while driving TCO down to little more than the hardware plus electricity.

Of course, doing it is not simple. It's a four-dimensional optimization problem—hardware, inference engine, model, and application—all four dimensions are changing rapidly, creating an enormous combinatorial space.

AIMA's approach is to make itself a thin infrastructure layer. A single Go binary, cross-platform compilation, zero CGO dependency—install it and use it.

# Detect hardware
aima hal detect
 
# Initialize infrastructure
sudo aima init
 
# Deploy model (auto-matches engine and config)
aima deploy apply --model qwen3.5-35b-a3b

Three commands, from bare metal to running model inference. What's happening behind the scenes? AIMA detects your GPU model and VRAM, matches the optimal engine and parameter configuration from the YAML knowledge base, generates a K3S Pod manifest, and brings up the inference service. The entire process doesn't require you to know whether to use vLLM or llama.cpp, or to manually configure CUDA paths or ROCm device mounts.

Currently supported hardware includes NVIDIA RTX 4060/4090/GB10, AMD Radeon 8060S and Ryzen AI MAX+ 395, Huawei Ascend 910B, Hygon BW150 DCU, and Apple M4. On the engine side, it supports vLLM, llama.cpp, SGLang, and Ollama.

Knowledge, Not Code

AIMA's approach is: knowledge beats code.

Traditionally, supporting a new piece of hardware or a new engine means writing piles of if-else branches. AIMA doesn't do that. Hardware characteristics, engine parameters, and model configurations are all defined in YAML files. The Go code only does numerical comparison and generic rendering, containing no vendor-specific branches.

Supporting a new engine? Write a YAML. Supporting a new model? Also write a YAML. 80% of capability extensions don't require recompilation.

The knowledge base looks like this: every model has multiple variants, and each variant is annotated with applicable GPU architectures, minimum VRAM requirements, inference engine types, and specific launch parameters. AIMA's ConfigResolver automatically matches the most suitable variant based on your current hardware state. If your RTX 4060 only has 8GB of VRAM, it will skip the vLLM option that requires 16GB and automatically fall back to a llama.cpp GGUF option.

All those fragmented pieces of knowledge that used to be scattered across forum posts, GitHub issues, and personal notes now have a unified, structured expression. Anyone can contribute a YAML, and anyone can reuse a configuration that someone else has already validated.

57 MCP Tools

AIMA exposes 57 MCP tools, covering hardware detection, model management, engine management, deployment, knowledge base, benchmarking, cluster management, and more.

Why does this matter? Because MCP is the standard protocol for AI Agents to operate external tools. By making everything available as MCP tools, any MCP Client—Claude Code, GPT, or an Agent you wrote yourself—can directly control everything on that device.

The CLI is a thin wrapper around the MCP tools, containing no additional logic. Humans use the CLI, Agents use MCP, and both follow the exact same code path.

The "Managed by AI" in the name refers to exactly this. AIMA doesn't embed a large model running internally; instead, it turns itself into infrastructure that AI Agents can directly manipulate. The Agent detects hardware, queries the knowledge base, deploys models, runs benchmarks, checks results, and adjusts configurations—the entire loop can run without human intervention.

Progressive Intelligence

AIMA has a design I find quite interesting: not every scenario has AI available, so it implements five levels of progressive intelligence.

At the bottom is L0. The default values from the YAML knowledge base are embedded directly into the Go binary at compile time. No internet, no AI, nothing—L0 still gives you a working inference service. It won't be optimal, but it provides a safety net.

Above that is L1, where humans can manually override parameters via the CLI. Then L2, golden configurations based on historical benchmarks—the best parameter combinations ever run on this hardware, distilled from benchmark data and reused directly next time.

At L3a, if the device itself has enough compute, a built-in Go Agent can use a local model for simple tool-calling loops and make some decisions on its own. At the highest level, L3b, it connects to an external powerful Agent (like Claude) that can perform complex tuning, troubleshooting, and exploration.

Each level works independently. From L0 upward, they stack incrementally, with each layer overriding the one below. When you get a new device, even if you start at L0, it can climb higher on its own as knowledge accumulates and network access becomes available.

One design principle here is "exploration as knowledge": every exploration the Agent makes—tuning parameters, troubleshooting, deployment attempts—produces structured Knowledge Notes that are written back to the knowledge base. Other devices' Agents can directly reuse this knowledge, skipping known failure paths and starting from the optimal point.

The more devices you use, the more knowledge gets沉淀 (precipitated/accumulated), and the easier subsequent devices become. The input to this loop is tokens and idle compute; the output is an ever-thicker knowledge base.

Wait, I shouldn't leave "沉淀" untranslated or with a Chinese note. The author is saying knowledge accumulates/sinks in. I'll say "the more knowledge accumulates".

Actually, looking back at the original: "用的设备多了，沉淀的知识就多" - I'll translate as "The more devices you use, the more knowledge accumulates".

LAN Is the Cluster

Another thing worth mentioning: Fleet management.

AIMA uses mDNS for automatic LAN discovery. You put five AI devices with different hardware in your office—no need to configure IPs, no need for a registry center. They discover each other automatically.

# Discover AIMA devices on the LAN
aima discover
 
# Execute tools remotely
aima fleet exec <device-id> hardware.detect
aima fleet exec <device-id> deploy.list

Every device exposes the same set of MCP tools, with remote and local access following the same path. For an Agent, managing one device or a fleet is essentially the same.

Less Code

Finally, a word on design philosophy.

AIMA's Go code has a few hard constraints: no code branches for engine types, no code branches for hardware vendors. Engine behavior, model metadata, and container access configurations for hardware all live in YAML.

Container lifecycles are handled by K3S, GPU memory slicing by HAMi. AIMA's scope is narrow: generate Pod YAML from knowledge, kubectl apply, query status.

Less code means the project is easier for AI to understand, and changes don't burn as many tokens. This aligns with my lesson from writing 300,000 lines of code in 10 days and then deleting it all—every line of code is a liability. As AI becomes more capable, having as little code as possible becomes an advantage: AI can participate in the project more fluently.

Open Source

AIMA is open-sourced on GitHub under the Apache 2.0 license. A single Go binary, cross-platform compilation, ready to use out of the box.

We will also offer a companion AIMA Service, so that when devices run into issues, they can be resolved remotely. Together, the goal is to drive the TCO of AI inference devices down to little more than the hardware, electricity, and a bit of token money.

High TCO means devices won't be fully utilized, and the market won't grow. Compute is scarce right now; even an old device, as long as it can still run inference, is wasted if it sits idle.