The Hidden Failure Modes of Event-Driven Systems
Event-driven systems look great in architecture diagrams.
Boxes emit events. Other boxes subscribe. Everything is decoupled and scalable.
The first time you try to debug one in production, the diagram stops being helpful.
A workflow fails halfway through. A service retries an event five times. Another service processes the same event twice. And the question everyone is asking is very simple:
What actually happened?
This article comes from debugging several production incidents in event-driven systems over the past few years. When events coordinate multi-step workflows, three failure modes appear again and again: hidden coupling, event ordering, and retry amplification.
Hidden Coupling
The first surprise teams run into is that decoupling is mostly an illusion. Services that communicate through events are coupled through event schemas. The payment service emits payment.completed with a specific structure. The notification service, the analytics pipeline, the fulfillment service, and the reconciliation job all depend on that structure.
When the payment service adds a field, nothing breaks. When it renames a field, everything breaks. When it changes the semantics of a field, everything breaks silently.
I've seen this happen with payment events. A service emitted payment.completed with an amount field. Originally it represented the net amount. Months later the service changed it to include tax.
Nothing crashed.
But the analytics pipeline suddenly reported a 13% revenue increase. The fulfillment service started calculating costs wrong. The reconciliation job caught the discrepancy three days later.
The consumers still parsed the event.
They just produced the wrong results.
Incidents like this rarely show up immediately. They surface days later as accounting discrepancies, reconciliation mismatches, or strange analytics spikes.
This is worse than synchronous coupling. With a direct API call, a breaking change produces an immediate error. With events, a breaking change produces wrong data that looks correct. The notification service shows the right number because it just displays it. The analytics pipeline computes wrong revenue. The reconciliation job catches the discrepancy days later.
The coupling didn't disappear. It moved from compile-time to runtime, from immediate to delayed, from visible to invisible.
Event-driven systems don't remove complexity.
They hide it.
This problem gets worse when services maintain their own projections of shared state.
The order service emits order.cancelled.
The fulfillment service hasn't processed it yet because it's working through a backlog.
The order ships anyway.
The window between event emission and consumption is a consistency gap that grows during incidents, backlogs, or deployments.
Event Ordering
Events arrive in the order the broker delivers them, not the order the business process requires.
In a simple publish-subscribe pattern, this is manageable. But when events coordinate a multi-step workflow, ordering becomes critical. Step B depends on step A completing first. If the completion event for step A is delayed or redelivered, step B either runs with stale state or fails in ways that are difficult to trace.
Partition-based ordering (Kafka's approach) helps within a single partition. But workflows that span multiple services typically span multiple topics and partitions. The ordering guarantee evaporates at the boundary where you need it most.
At-least-once delivery guarantees that every event arrives. It does not guarantee when, in what order, or how many times. Consumers must handle all three.
The practical result is that every consumer must be defensive. It must handle events that arrive out of order, events that arrive twice, and events that arrive minutes or hours late. This isn't a library you install. It's an architectural constraint that shapes every consumer in the system.
Retries Amplify Failures
At-least-once delivery means consumers retry on failure. This is correct behavior. But when a downstream service is struggling, retries amplify the problem.
A consumer processes events from a queue. It calls a database to store results. The database is slow due to lock contention. Consumer processing times out. The event is redelivered. The consumer tries again. The database is still slow, now with additional load from the retry. More timeouts. More redeliveries.
The core problem is simple: the queue retries faster than the system can recover.
The queue doesn't know the database is overloaded. It just knows the consumer didn't acknowledge the event.
When events coordinate multi-step workflows, debugging these failures becomes archaeology. To answer "why didn't this task run?" you have to reconstruct the chain of events across multiple services. Which event triggered the step? Did it arrive? Was it processed? Did the retry succeed or create a duplicate? Each question requires correlating logs across service boundaries.
Observability Is the Real Problem
In synchronous systems, failures are obvious. A request fails. You get a stack trace.
In event systems, failures scatter across five services and three queues.
Correlation IDs help, but only if every service propagates them correctly through every event. One service that drops the correlation ID breaks the entire trace. One service that logs to a different system creates a gap in the timeline.
The worst part is discovering the correlation ID was dropped by one service three months ago and nobody noticed.
The hardest observability problem isn't failures. It's the absence of events. If an event was never emitted (a bug in the producer), the consumer never processes it, and nothing fails. The order just never gets fulfilled. No error. No alert. Nothing to debug until someone asks "where is my order?"
Monitoring consumer lag, event counts, and expected-vs-actual processing rates catches some of these issues. But designing observability for what didn't happen is fundamentally harder than observability for what did. In a workflow with ten steps coordinated by events, the failure surface isn't ten services. It's the space between them.
From Events to Orchestration
Event-driven systems don't remove complexity. They trade visible complexity for invisible complexity. The architecture diagram gets simpler. The failure modes get harder.
For simple publish-subscribe patterns, this tradeoff works. But as workflows grow in steps and dependencies, raw event coordination breaks down. When five services must execute in sequence with retries, compensations, and timeout handling, "emit events and hope consumers handle it" isn't an architecture. It's a liability.
This is where workflow engines change the equation. Instead of each service coordinating its own piece of the process, a single system tracks the entire workflow: which step is running, which steps succeeded, which steps failed, and what should happen next. The event chain becomes a queryable state machine. "Why didn't this task run?" becomes a database query instead of a forensic investigation.
These same problems are starting to appear in AI agent systems. An agent coordinating tool calls across multiple services is a multi-step workflow with unreliable steps. Model calls time out. Tool invocations fail. Context gets lost between steps. The failure modes are identical: ordering, retries, hidden coupling, and invisible failures. Tool calls behave like unreliable distributed steps. Agents need the same infrastructure that event-driven workflows need: durable state, retry policies, dead letter queues, and observability into what actually happened.
Event-driven architecture is powerful.
But once workflows span multiple services, events alone stop being enough.
At that point you're not building an event system anymore.
You're building a distributed workflow engine.
Whether you realize it or not.
References
- Should You Put Several Event Types in the Same Kafka Topic?· Confluent Blog
- Martin Fowler: Event-Driven Architecture· Martin Fowler
- Apache Kafka Documentation· Apache Kafka
- Scaling Event Sourcing for Netflix Downloads· Netflix Technology Blog
Next
Engineering Systems That Humans Can Debug →Why observability matters more than performance and how to design for incident response.