Why Event Driven Systems Are Harder Than They Look

9 minute read|Published January 2026

Event driven architecture looks elegant on a whiteboard. Services emit events. Other services consume them. Everything is decoupled. Everything scales independently. The architecture diagram is clean and the team is excited.

Six months later, the system is a mess. Events arrive out of order. Consumers process duplicates. A schema change in one service breaks three others silently. A retry storm takes down a downstream database. Nobody can figure out why order 48291 was charged twice.

Event driven systems are harder than they look because the problems they create are invisible until they explode.

Hidden Coupling

The first thing teams discover is that decoupling is an illusion. Services that communicate through events are coupled through event schemas. The payment service emits payment.completed with a specific structure. The notification service, the analytics pipeline, the fulfillment service, and the reconciliation job all depend on that structure.

When the payment service adds a field, nothing breaks. When it renames a field, everything breaks. When it changes the semantics of a field - say, amount now includes tax where it previously excluded it - everything breaks silently. The consumers still parse the event. They just produce wrong results.

Loading diagram...
Event-driven coupling is invisible until a schema change breaks multiple downstream consumers simultaneously.

This is worse than synchronous coupling. With a direct API call, a breaking change produces an immediate error. With events, a breaking change produces wrong data that looks correct. The notification service shows the right number because it just displays it. The analytics pipeline computes wrong revenue. The reconciliation job catches the discrepancy days later.

The coupling did not disappear. It moved from compile-time to runtime, from immediate to delayed, from visible to invisible.

Retry Storms

At-least-once delivery means consumers retry on failure. This is correct behavior. But when a downstream service is struggling, retries amplify the problem.

A consumer processes events from a queue. It calls a database to store results. The database is slow due to a lock contention issue. Consumer processing times out. The event is redelivered. The consumer tries again. The database is still slow, now with additional load from the retry. More timeouts. More redeliveries.

  Time →

t=0    Consumer processes event → DB slow → timeout
t=1    Event redelivered + new events arriving
       Consumer processes 2 events → DB slower → 2 timeouts
t=2    2 events redelivered + new events arriving
       Consumer processes 4 events → DB overloaded → 4 timeouts
t=3    4 events redelivered + new events arriving
       Consumer processes 8 events → DB down

┌──────────────────────────────────────────────┐
│  Consumer lag: 0 → 100 → 1,000 → 50,000     │
│  DB connections: 10 → 40 → 160 → saturated  │
│  Error rate: 5% → 40% → 90% → 100%          │
└──────────────────────────────────────────────┘

The event system faithfully redelivers every failed event.
Each redelivery makes the underlying problem worse.
Retry storm cascade. A slow downstream dependency triggers exponential load amplification through the event system.

Circuit breakers help. Consumer-side rate limiting helps. But the fundamental dynamic is that event systems amplify downstream failures because the retry mechanism is built into the delivery infrastructure, not the application. The queue does not know the database is overloaded. It just knows the consumer did not acknowledge the event.

Consumer Idempotency Is Harder Than It Sounds

Every guide says "make your consumers idempotent." In practice, this is difficult.

Simple deduplication - tracking processed event IDs - works for straightforward cases. But real consumers have side effects: they write to databases, call APIs, send notifications, update caches. Making all of those operations idempotent requires transactional coordination.

def process_order_event(event):
    # This must be atomic with the dedup check
    if already_processed(event.id):
        return

    order = create_order(event.data)        # DB write
    send_confirmation(order)                 # Email API call
    update_inventory(order.items)            # Another DB write
    emit_event("order.created", order)       # Another event

    mark_processed(event.id)                 # Dedup record

If the process crashes after send_confirmation but before mark_processed, the event is redelivered. The consumer creates a duplicate order and sends a duplicate email. Making this fully idempotent requires wrapping everything in a single transaction - but the email API is not transactional.

True idempotency in event consumers requires designing every side effect to be safely repeatable. This is not a library you install. It is an architectural constraint that affects every line of consumer code.

The practical solution is to separate the idempotent state change (database write + dedup record in one transaction) from the non-idempotent side effects (email, external API calls) and accept that side effects may occasionally execute twice. This is a design tradeoff, not a clean solution.

Data Drift Between Services

In a synchronous system, services share state at the moment of the call. In an event driven system, services maintain their own projections of shared state, updated asynchronously. These projections drift.

The order service processes a cancellation. It emits order.cancelled. The fulfillment service has not consumed this event yet because it is processing a backlog. It ships the order. The customer receives a package for an order they cancelled.

The window between event emission and consumption is a consistency gap. In high-throughput systems, this gap is usually milliseconds. During incidents, backlogs, or deployments, it can be minutes or hours.

Timeline:
  t=0   Order service: cancels order #1234
  t=0   Order service: emits order.cancelled
  t=0   Fulfillment service: processing backlog (lag: 45 min)
  t=30  Fulfillment service: processes order.created for #1234
  t=30  Fulfillment service: ships order #1234
  t=45  Fulfillment service: processes order.cancelled for #1234
  t=45  Fulfillment service: too late, already shipped

This is not a bug. This is the fundamental tradeoff of eventual consistency. The system is correct - eventually. But "eventually" can be too late for business operations that have real-world side effects.

Observability Gaps

In a synchronous system, a request failure produces a stack trace. In an event driven system, a failure produces fragments across multiple service logs with no inherent connection.

Correlation IDs help, but only if every service propagates them correctly through every event. One service that drops the correlation ID breaks the entire trace. One service that logs to a different system creates a gap in the timeline.

The hardest observability problem is not failures - it is the absence of events. If an event was never emitted (a bug in the producer), the consumer never processes it, and nothing fails. The order just never gets fulfilled. No error. No alert. Nothing to debug until someone asks "where is my order?"

Monitoring consumer lag, event counts, and expected-vs-actual processing rates catches some of these issues. But designing observability for what did not happen is fundamentally harder than observability for what did happen.

How to Design Event Driven Systems That Survive

After operating event driven systems across healthcare and financial transaction workflows:

  • Version your event schemas from day one. Treat events as a public API contract with the same discipline.
  • Implement consumer-side circuit breakers. The queue's retry mechanism is not enough to protect downstream dependencies.
  • Build reconciliation jobs that detect drift between service projections. Do not trust eventual consistency for business-critical operations.
  • Monitor consumer lag as a primary operational metric. Lag is the early warning for every problem listed above.
  • Accept that some operations need synchronous confirmation. Not everything belongs in an event. Time-sensitive operations with real-world side effects may need direct calls.
  • Design consumers to handle duplicate, out-of-order, and delayed events. Assume the worst delivery characteristics and build accordingly.

Event driven architecture is powerful. It is also unforgiving. The systems that survive are the ones designed by engineers who understand what will go wrong, not just what should go right.