Recently refactored Aima Service. Took a little over a week, and a lot of thoughts came up along the way. Writing them down.
Emergency Room
Aima Service isn't a tool you open every day. It's more like an emergency room for devices—you don't think about it normally, only when something goes wrong.
This creates an awkward situation: users feel nothing when it succeeds. The problem gets solved, they go "Hmm, okay," and leave. Since they never experienced the pain, they naturally don't realize how big of a deal was fixed.
Failures, on the other hand, are interesting.
There are several kinds of failure. The agent tries hard but can't fix it, telling you the task failed—that's one. It claims success but when you check, nothing was fixed—a false success, that's another. But the worst is when the pipeline itself breaks. Crashes, freezes, disconnects mid-run.
The first two are tolerable. Like going to the ER—the doctor looks at you for a while but can't cure you, so you go somewhere else. The last one is different. You dial 120 (China's emergency number), the ambulance comes, but when you arrive at the ER the doors are closed. Or you get inside, the registration system crashes, the doctor disappears halfway through, and someone comes out to tell you "Sorry, we're closed for today."
That's the one that actually breaks people.
Decor Doesn't Matter, the Doctor Must Be There
Once you understand this, priorities become clear.
In an emergency room, no amount of luxury decor matters, no sofa is comfortable enough. Only two things matter: does the door open, and can the doctor treat patients.
The previous version was functionally adequate but unstable. Running tasks was bug-ridden, constantly crashing, all kinds of inexplicable freezes. The ER door was open, but the doctor wasn't there.
So we did a complete refactor. No new features—just rebuilding the foundation.
Two Models Taking Turns
The refactor used two models, Claude Code and Codex, taking turns.
First we designed the documentation system, had both models read the docs and code, and list out what needed to be done. Then Claude Code ran the first round of refactoring, after which Codex took over for the next round. Back and forth.
Why not just use one? We tried—it tends to drift. Claude Code has strong architectural sense, can see structural-level issues, but sometimes it's too conservative. Codex moves fast, dares to get its hands dirty, and will revisit details, but occasionally it's too coarse. Letting just one run amplifies its weaknesses. Alternating actually creates the best rhythm—problems found by one are often fixed by the other.
Later I looked it up—there's a NeurIPS 2025 paper called "Lessons Learned" that specifically analyzes how different LLMs can complement each other, concluding that around 3 agents works best, with diminishing returns beyond that. That matches the real-world feel.
Each round, the model spins up its own agent team to run in parallel. A single task taking three or four hours is normal. It's asynchronous anyway—humans do their own thing, occasionally checking in and asking a few questions.
CI/CD Ate More Time Than Expected
The functional refactor was mostly done in about a week. The real time sink came after.
The code changed massively. What catches you before going live? Automated testing, build checks, integration validation—none can be skipped. One pipeline run takes dozens of minutes; fix a problem, run again, another few dozen minutes. Then repeatedly tuning test cases and optimizing the pipeline—we wrote tens of thousands of lines of test code alone.
When people talk about AI writing code, they focus on "how fast." But if you're actually shipping to production, CI/CD and testing can eat up more than half the engineering effort. AI can do this part too, but it takes many iterations and lots of debugging. No rushing it.
1.3 Million Lines
After the refactor, a report was generated. I looked at the numbers.
Functionally, nothing new was added—it looks unremarkable. But the architecture went from "run a task, get a pile of bugs" to a design that can handle six-figure users. Modules are much cleaner, distributed workers and overseas federation are both in place.
The code volume surprised me: roughly 1.3 million lines excluding docs, 1.7 million including docs. Eight or nine days, one non-specialist, two AI models.
Then I remembered something.
The Moat Has Dried Up
A popular idea in recent years: code volume is the moat of software companies. Millions of lines piled up—no one can copy it.
I remember reading about this back then and thinking it made sense. Cisco IOS XE has 190 million lines of code, maintained by over 3,000 people, shipping 700+ new features per year. SAP's ABAP codebase exceeds 250 million lines, and fewer and fewer people can read it. These companies were indeed propped up by "this thing is so complex, no one can replace it."
But thinking it through, code volume was never the whole moat. Cisco's moat is 190 million lines plus ecosystem, switching costs, and brand. SAP's moat isn't that ABAP is hard to write—it's that only 5% of its 425,000 customers migrated to S/4HANA within seven years. Lidl tried, burned €500 million, and gave up. Revlon lost $64 million in sales. The lock-in effect created by complexity is far more stubborn than the code itself.
But now it gets interesting. Earlier this year Marek Kowalkiewicz wrote a piece called "Drying the Moat". In it, he mentions that after Anthropic demonstrated AI could read and modernize COBOL systems, IBM lost $40 billion in market cap in a single day. Code complexity actually creates an "understanding asymmetry": you can't read my code, so you can't leave me. AI wipes out that asymmetry.
Looking back at my own experience: eight or nine days, one person with two models alternating, and a 1.3 million line system is running in production. With distributed architecture, with CI/CD pipelines.
1.3 million lines is obviously not Cisco's 190 million—we're two orders of magnitude apart. Ecosystem and customer lock-in aren't replaceable by code either. But the "code complexity" leg is already being pulled out. How long the remaining legs can hold, it's hard to say.
Act on It
Looking back at the whole process, a few takeaways.
AI can build product-grade software. What came out of this is running in production with real users—not a demo. The hard part is CI/CD and testing. These take time, but AI can do them too; it just needs more iterations.
Refactoring isn't so scary anymore. In the past, inheriting a mountain of legacy code meant weeks just to figure out what it was doing. Now a model can ingest it and draw an architecture diagram in minutes. From legacy mess to new architecture in a week, technical debt cleaned up cleaner than by hand.
The biggest shift might be mental. Refactoring used to be a major decision—you'd calculate headcount, timeline, risk. Now, if the codebase can't keep up, just rebuild it. Same when new tech comes out: just rebuild with the new stack. Code is increasingly becoming a consumable.
Of course code is still the skeleton of the product. That hasn't changed. But the cost of producing that skeleton is no longer on the same scale as before.
