The Software Bug AI Can't Find

Coding is not the hard part.

Any reasonably skilled developer can implement a described requirement. Given a clear specification, the implementation follows. That has always been true, and AI has made it more true — the mechanical work of turning a description into working code is now faster and cheaper than it has ever been.

The hard part is something different. It is thinking of everything that can go wrong. The data combination nobody anticipated. The sequence of operations that seemed impossible until a user found it. The edge case that only appears when two legitimate business scenarios collide in a way nobody modeled. These are not coding failures. They are imagination failures — and no developer, no team, no AI has ever been immune to them.

The industry learned this lesson once before. Waterfall's central assumption was that requirements could be fully specified before building began — that if you thought hard enough upfront, you could anticipate everything. It couldn't be done. The act of building revealed what nobody knew before building started. Scenarios emerged from real usage that no specification session had surfaced. The industry eventually accepted this and moved on.

The same assumption lives inside distributed architectures, one level down. You cannot anticipate every failure mode before the system meets real data. The question is not how to eliminate that gap — you cannot. The question is: when reality finds the gap, how fast does the system tell you?

There are two possible answers. The system fails loudly — the operation stops, nothing partial is committed, the error is visible, the developer finds it, it gets fixed. Or the system fails silently — the operation appears to succeed, something partial is committed somewhere, the inconsistency enters the data, and nobody knows.

Loud failure is not a side effect of good architecture. It is a feature — the mechanism by which a system corrects its own gaps as reality reveals them. It needs to be deliberately designed in. And a surprising number of the technology choices the industry has normalised over the last decade quietly design it out.

Everything that follows is a consequence of that distinction.

The Transaction Is Not a Technical Detail

A single atomic transaction is the simplest possible implementation of loud failure.

Something unexpected happens. The transaction fails. Everything inside the consistency boundary rolls back — the order wasn't created, the inventory wasn't reduced, the invoice wasn't generated. The state before the operation is restored exactly. The user sees an error. A developer looks at the error. They find the unconsidered scenario. They fix it. The feedback loop is hours, not months. The system's integrity was never compromised — just its availability, temporarily, for one specific operation.

That is not a bug. That is the system working correctly under unexpected conditions — surfacing a gap in understanding at the cheapest possible moment, before anything was lost and before the inconsistency had a chance to compound.

This is why technology choices for enterprise applications are not preferences. They are engineering decisions with structural consequences. A relational database brings three decades of battle-tested infrastructure for loud failure: non-nullable constraints, unique constraints, foreign key constraints, check constraints. These are not convenience features. They are a validation layer that lives closer to the data than any application code ever will, enforced regardless of which service forgot to set a field, regardless of which event handler failed to fire. The database simply refuses. Loudly. Immediately.

Choosing to move away from a relational database is a legitimate engineering decision in specific circumstances. But it is not a neutral one. Every constraint the database was enforcing either moves into the application — where it is less reliable, harder to find, and maintained by people who may not know why it exists — or it disappears entirely, replaced by the hope that nobody will generate the data combination it was preventing. The validation does not vanish. It relocates, or it becomes invisible. Both outcomes are a step toward silent failure.

Choosing a technology because it is popular, because a large company published a paper about it, because it appeared at a conference — without asking what properties it provides and what properties it removes — is not engineering. It is fashion. And in enterprise software, fashion has structural consequences that surface years later in production data nobody can explain.

What Happens When You Distribute

Now take the same unexpected scenario and run it through a distributed system.

Service A processes its part and commits. An event fires. Service B receives it and fails — not because of bad code, but because this specific combination of data was never anticipated. Compensation logic can recover consistency, but only for the scenarios it was written to handle. Nobody wrote compensation logic for this combination, because nobody anticipated it. Service A has committed. Service B has not. The state is now inconsistent, and recovery now depends on logic whose correctness must itself be proven — for a scenario that, by definition, nobody saw coming.

The user may not even see an error. The system appears to have worked.

The inconsistency is now in production. Downstream services are making decisions based on it. Reports are being generated from it. Other operations are building on top of it. And nobody knows, because the system did not fail — it partially succeeded, which is the failure mode that distributed architectures are structurally unable to surface cleanly.

Eighteen months later, someone notices the numbers don't add up. Or a customer calls about an order that shows as delivered but was never shipped. Or an audit finds financial records that contradict each other. The forensic work to trace that back to its origin — through eighteen months of events, across service boundaries, through compensation logic written by someone who left a year ago — is enormous. The fix is not a code change. It is a data integrity project, with permanent uncertainty about what the correct state actually was.

The distributed system did not prevent the bug. It prevented the bug from being visible. Which is the worst possible trade — because loud failure is the mechanism the system uses to learn. Remove it and the system stops teaching. It just accumulates.

The Part the "It Works" Argument Misses

Here is where the reasonable objection comes in. Many distributed systems do work. Microservices applications run in production for years without the failure mode described above ever materialising. If yours is one of them, the argument so far probably seems theoretical.

It is not theoretical. It is probabilistic — and the probability scales directly with the thing you most want to scale.

Small application. Bounded domain. Limited entities, limited relationships, limited users, limited lifespan. The space of possible data combinations is small. The unconsidered scenario may simply never arrive before the system is retired. "It works" is genuinely true, start to finish. There is nothing to argue with.

Now scale the application. More entities. More relationships. More users generating more combinations over more years. The space of possible data combinations grows faster than the team grows. The probability of hitting an unconsidered scenario does not stay constant — it compounds. At sufficient scale, over sufficient time, it stops being a risk and becomes a mathematical certainty.

Which means the architectural choice that feels safe for a small system becomes a liability that scales directly with the size and longevity of the application. The system the organisation most wants to protect — the large, long-running, business-critical application — is exactly the system where silent failure becomes a certainty rather than a possibility.

The developer who says "so what, it works" is describing a small system. They are right. They just don't realise that is what they are describing.

AI Accelerates the Accumulation

This is where the current moment makes the stakes undeniable.

AI has the same imagination failure every human developer has. It implements what it was asked to implement. It does not anticipate the data combination that wasn't in the prompt. It does not model the collision between two legitimate business scenarios nobody thought to describe. And it generates code at a velocity that outpaces the domain understanding feeding it — accumulating unconsidered scenarios faster than any human team ever could.

There is a subtler problem underneath that one. Writing code for a complex domain is not just implementation. It is how understanding develops. When a requirement does not fit cleanly, when the same logic appears in three places, when a method grows in ways that resist being read — that resistance is signal. The domain is surfacing a gap. The friction is the feedback loop by which an engineer's understanding deepens and the model gets corrected. AI used as an implementer absorbs that resistance. The code gets written. The discomfort never arrives. The lesson was in the discomfort.

This is not a new failure mode. It is the continuation of a trend the industry has been on since framework-dictated development became the norm — where pre-packaged architectural recipes replaced structural reasoning, and engineers learned to fill in templates rather than interrogate structure. AI-as-implementer is the same dynamic, one abstraction level higher, running faster. The gap between working software and understood software was already widening before AI arrived. AI inherited that gap and accelerated it.

In a system with a coherent consistency boundary, this matters less at the architectural level. The unconsidered scenario still fails loudly — AI-generated or not. The transaction fails, the error surfaces, the gap is found and fixed. The system remains self-correcting even when the engineer's understanding was incomplete.

In a distributed system built at AI velocity, the unconsidered scenario fails silently — at a rate no previous generation of development ever achieved. The events queue. The inconsistencies compound. The data drifts. And the diagnosis, when it finally comes, will be what it always was: the domain was complex, the requirements changed, the previous developers were careless.

Not: we built at a speed that outran our understanding, into an architecture that was designed to hide what we didn't know.

Why the Industry Got Here

Nobody chose this deliberately. That is worth saying plainly before any diagnosis.

Public software discourse is necessarily shaped by practices that can be taught, repeated, and verified at scale. The patterns that dominate conference talks, blog posts, job descriptions, and interview questions are the ones legible enough to transfer reliably between practitioners — not necessarily the ones that produce systems which remain coherent over a decade. That is not a criticism of the people involved. It is how knowledge disseminates in any field where the most consequential outcomes take years to become visible.

And because every system is built once — the alternative approach is never built alongside it, so the cost of the wrong choice is never directly observable — the field cannot easily learn from its own experience. When a system develops data integrity problems, the cause gets attributed to domain complexity or changing requirements. Almost never does anyone conclude that the architecture was the variable, because there is no control group to compare it to. The unfalsifiability problem keeps the signal from reaching the people who most need it.

Microservices, event-driven architecture, NoSQL databases — each originated as a genuine response to a genuine problem at genuine scale. Each got adopted as a default by teams who never encountered the scale problem the pattern was designed to solve, chosen not for their engineering properties but for their cultural visibility. And each, in its own way, does the same thing: it relocates the signal.

The code stops complaining. The architecture absorbs the contradiction without surfacing it. The problem does not go away — it moves to the production data, two years out, in a form that is harder to find, harder to trace, and harder to fix than the loud failure it replaced.

They are not solutions to the underlying problem. They are ways of making the underlying problem less observable. Which, at sufficient scale, is worse.

Engineering for Properties, Not Popularity

The correction is not a methodology. It cannot be certified. It is a discipline of asking a question that the industry has largely stopped asking: what properties does this technology choice provide, and what properties does it remove?

A relational database provides transactional consistency, referential integrity, and constraint enforcement as structural guarantees — not as features to be implemented, but as properties of the system that exist regardless of what any individual piece of code does. Removing it in favour of a document store or a distributed data layer removes those guarantees. They do not vanish. They become engineering problems to be solved in the application, maintained indefinitely, by teams who may not fully understand why they are there.

A single deployable unit with a coherent consistency boundary provides loud failure for free. Splitting that unit across services and event queues removes it. Sagas and compensation logic can partially recover it — for the scenarios that were anticipated. For the ones that weren't, which is the only class of scenario this article has been about, recovery depends on logic that by definition could not have been written yet.

These are not preferences. They are structural choices with structural consequences, most of which arrive too late to be traced back to the decision that produced them.

The organisation that chooses its database because a large tech company uses it, its architecture because it was the subject of last year's conference circuit, its framework because it is what the available developers already know — is not making engineering decisions. It is making fashion decisions and calling them engineering. The difference between the two is not visible on the day the choice is made. It is visible in the production data, two years later, when the unconsidered scenario finally arrived and the system had no mechanism to surface it.

A coherent contextual center — a domain model that keeps what belongs together in one place, behind a single consistency boundary, enforced by the database that was built to enforce such things — is the structural embodiment of the principle this article has been arguing for. Not because it is elegant. Because it keeps failure loud, keeps the feedback loop intact, and keeps the system capable of correcting itself as reality reveals what nobody knew upfront.

Which it will. It always does.

What to Do With This

If you are building a small application with a bounded scope and a limited lifespan, the considerations above matter less. The unconsidered scenario may never arrive. The combination space never gets large enough. Disposable software benefits from disposable development, and the current generation of tools makes disposable development faster than it has ever been.

If you are building something large, long-running, and business-critical — the kind of system that needs to remain correct through changing requirements and changing teams over years — the question worth asking before framework selection, before architecture diagrams, before any technology choice is made, is this: what is the failure mode of this decision, and when will it announce itself?

Choose technologies for what they provide, not for what they are associated with. Prefer the ones that keep failure loud and immediate over the ones that keep it quiet and deferred. Accept that no amount of upfront specification eliminates the unconsidered scenario — and build systems that surface it fast, correct it cheaply, and carry the correction forward in a form that survives the next team.

The loudest signal that this discipline is absent will not come from the code. It will come from the data, later than you expect, in a form that is harder to explain than a bug and more expensive to fix than a refactor.

The best time to build for loud failure was at the start of the project. The second best time is before the data starts lying to you.

This article is part of a series on software engineering craft. Other pieces examine the rich domain model as a discipline, the properties of enterprise software that lasts, and why the practices that prevent structural decay are the same ones that always prevented it.

The Software Bug AI Can't Find

The Transaction Is Not a Technical Detail

What Happens When You Distribute

The Part the "It Works" Argument Misses

AI Accelerates the Accumulation

Why the Industry Got Here

Engineering for Properties, Not Popularity

What to Do With This

Comments

More from this blog

SOLID Principles: Forks to Eat Soup

What is the reason for using a rich domain model in the age of AI?

AntiPatterns Never Left, We Just Stopped Calling Them by Name

How To Prevent Contradicting AI Prompts

Command Palette

The Transaction Is Not a Technical Detail

What Happens When You Distribute

The Part the "It Works" Argument Misses

AI Accelerates the Accumulation

Why the Industry Got Here

Engineering for Properties, Not Popularity

What to Do With This

Comments

More from this blog