AI's Spear and Shield

Recently, during development, I've noticed some pretty interesting points, and there's been no shortage of happenings out in the world either. This era hasn't slowed down; it's still moving forward at an astonishing pace. Here are a few scattered thoughts.

Coding Agents Are Better at Debugging Than Writing Code

Many posts and reports are discussing one question: what's the easiest problem to run into when using a coding agent to write code? The answer is bugs.

But from the actual experience of using them, I think this is precisely backwards.

Current coding agents perform better at debugging and troubleshooting than at writing code. The reason isn't complicated: the goal of debugging is clear, usually reproducible, and can be broken down and verified step by step. This is the kind of work AI handles smoothly, and much faster than humans. What's truly challenging, on the contrary, is getting it to implement something complete from scratch—especially when what you want to build is a product.

The Deep Waters of Productization

Traditional software development has so many processes—unit testing, integration testing, stress testing, canary releases, Alpha, Beta—not because people love process, but because once software is put in front of real users, it exposes all kinds of problems you could never have anticipated in the code. Best practices can reduce the probability of issues, but they can't eliminate them. Only time and the pressure of real usage can force problems to the surface, to be solved one by one.

This challenge is the same for coding agents.

Building a small component from scratch is fast. Prototype validation feels great. But as the product solidifies and the codebase grows, problems emerge: the larger the project, the easier it is for AI to break things, and the cost of understanding context rises visibly. This is the same principle as when humans maintain large projects—after a product is built, the difficulty of iteration naturally increases, and it takes team division of labor to bear the load. AI agents can't escape this law either.

So I think the current state is this: validating concepts is fast and satisfying. But turning a concept into a product requires deep thinking and validation at every step—nothing can be skipped.

Creativity-Driven Open Source

But there's one category of direction where AI is genuinely especially capable.

Recently, Milla Jovovich—the star of The Fifth Element and the Resident Evil series—spent several months working with engineer Ben Sigman to build an open-source AI memory system called MemPalace using Claude Code. It was pushed to GitHub on April 5th, gained 7000+ stars within 48 hours, and has now surpassed 22,000.

In evaluations on LongMemEval, MemPalace achieved an R@5 of 96.6%, far surpassing paid solutions like Mem0 and Zep, which sit at around 85%. It runs entirely locally, uses ChromaDB + SQLite, is under the MIT license, and is completely free.

AI memory is indeed a direction that has drawn a lot of attention this year. But MemPalace's strength isn't in its complexity—on the contrary, it doesn't win through complexity but through creativity. It's about focusing tightly on a single goal, such as hitting a benchmark, and then figuring out how to do it well.

This model is especially well-suited for AI assistance. The more focused the problem and the more it relies on ideas rather than engineering volume, the faster AI can help you validate it. There are more and more projects like this in the open-source world, and I think it's a very interesting direction.

No Software Is Secure

The biggest news these past couple of days is Anthropic's official unveiling of Project Glasswing.

Their next-generation model, codenamed Mythos internally, was prematurely exposed at the end of March due to an internal data leak (a CMS configuration issue that accidentally exposed roughly 3000 internal documents) and officially announced on April 7th. The model's capabilities in software security have reached a level that even Anthropic itself is hesitant to release.

Previous models could find vulnerabilities—that's nothing new in the industry. But converting a vulnerability into a usable attack vector is a completely different matter. Mythos combines these two steps.

Data disclosed by Anthropic is alarming: Mythos discovered thousands of zero-day vulnerabilities across all mainstream operating systems and browsers, including one that had been hiding in OpenBSD for 27 years. The kind of vulnerability that might have gone undiscovered for decades is now being unearthed by a model—and can be directly turned into an attack tool.

Basically, in the face of this model, no software is secure.

Anthropic's assessment is that this model cannot be publicly released. They reached out to roughly 45 companies—including Apple, Google, Microsoft, Nvidia, AWS, plus CrowdStrike, Palo Alto Networks, Cisco, the Linux Foundation, and others—to let them use Mythos early to harden their own systems. The logic is straightforward: before it becomes a spear, let it serve as a shield.

OpenAI isn't having an easy time either. GPT-5.4 became the first general-purpose model rated as a "high cybersecurity risk" by OpenAI's own Preparedness Framework. From GPT-5 to GPT-5.4, the model's score in CTF (Capture The Flag) competitions jumped from 27% to 76%. OpenAI chose to add a layer of safety protections and release it as usual—a different approach from Anthropic's, but the problem they face is the same: the offensive capabilities of models are growing exponentially.

I had a sense this was happening before. With Mythos's emergence, it's basically confirmed. And this isn't just a software-domain issue—when something in a new dimension develops at a completely unexpected speed, many of the surrounding supporting structures fall out of alignment. Regulations can't keep up, organizations can't keep up, and security systems can't keep up.

Build the Framework First

These developments have also influenced my thinking about building products.

We've been discussing internally whether some of our product positioning is too aggressive—for example, designs that let AI fully autonomously manage certain processes. If the model isn't smart enough yet and always requires human intervention, then that design doesn't really hold up at the moment.

But from another angle, perhaps product design should run slightly ahead of the models.

This is exactly Anthropic's approach to building products internally. For their Chrome extensions and Excel add-ins, they start with an idea, build out a framework, and then test each new generation of models against it—to see how far it can go. Wait, wait, and one day when it looks almost there, they invest heavily in productization and release.

If you design products based on current model capabilities, they'll likely be outdated by the time they launch. It's better to be slightly more aggressive: get the architecture right first, and once the engine is ready, the whole thing will naturally come together. Build it when you think of it, then wait a while.

The Game Continues

One final piece of good news.

Zhipu's GLM-5.1 was officially open-sourced in early April under the MIT license, with weights fully public. It scored 58.4 on SWE-Bench Pro, surpassing GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. And at the same time, they raised prices by 10%—while the rest of the industry is in a price war, raising prices against the trend is an interesting move in itself.

When it comes to the rules of the open-source game, no one has retreated yet. A welcome sight.