Event Driven Architecture for Transaction Systems

8 minute read|Published October 2025

Overview

When transaction workflows span multiple services, synchronous request-response architectures create tight coupling that amplifies failures. A slow fraud detection service blocks checkout. A payment service outage takes down order creation. A notification failure prevents fulfillment from starting.

Event driven architecture decouples these services. Instead of calling each other directly, services emit events that other services consume independently. The payment service does not know or care whether the notification service is running. It publishes a payment.completed event and moves on.

This decoupling enables independent scaling, independent failure, and independent deployment of services that participate in the same transaction workflow. But it introduces a different set of problems: hidden coupling through event schemas, duplicate processing from retries, and observability challenges that make debugging significantly harder.

Problems

Architecture

Loading diagram...
Event driven transaction architecture. Services communicate exclusively through the event stream.

Events are partitioned by transaction ID, which guarantees that all events for a given transaction are processed in order by a single consumer instance. This eliminates cross-partition ordering issues for the most common case while allowing parallel processing across different transactions.

The event stream retains events for a configurable period, enabling consumers to replay events for recovery or backfilling. This makes the event log an audit trail and a recovery mechanism, not just a transport layer.

Key Engineering Challenges

Consumer Idempotency

Every consumer must handle duplicate events safely. The standard approach is to track processed event IDs and skip duplicates:

class IdempotentConsumer:
    def __init__(self, handler, processed_store):
        self.handler = handler
        self.processed_store = processed_store

    def consume(self, event):
        if self.processed_store.exists(event.id):
            self.metrics.increment("events.duplicate")
            return

        try:
            self.handler.process(event)
            self.processed_store.mark_processed(event.id)
        except Exception as e:
            self.metrics.increment("events.failed")
            raise

The critical detail is that mark_processed must happen in the same transaction as the business logic side effects. If the handler writes to a database, the processed event ID should be written in the same database transaction. Otherwise, a crash between processing and marking creates the same duplication problem.

Schema Evolution

Event schemas must evolve without breaking existing consumers. This requires treating events as a contract between services with explicit versioning and compatibility rules.

// Event schema with version field
interface PaymentCompletedV2 {
  version: 2;
  event_id: string;
  transaction_id: string;
  amount: number;
  currency: string;       // Added in v2
  payment_method: string; // Added in v2
  timestamp: string;
}

// Consumer handles multiple versions
function handlePaymentCompleted(event: PaymentCompletedV1 | PaymentCompletedV2) {
  const currency = "version" in event && event.version >= 2
    ? event.currency
    : "USD"; // Default for v1 events

  processPayment(event.transaction_id, event.amount, currency);
}

The rule is: new fields can be added, existing fields cannot be removed or renamed, and consumers must handle older event versions gracefully. This is the same backward compatibility discipline as API versioning, but applied to every event in the system.

Distributed Tracing

Every event must carry a correlation ID that links it to the originating transaction. This enables reconstructing the complete transaction flow across services:

class EventEmitter:
    def emit(self, event_type, data, correlation_id):
        event = {
            "id": generate_uuid(),
            "type": event_type,
            "correlation_id": correlation_id,
            "data": data,
            "timestamp": datetime.utcnow().isoformat(),
            "source_service": self.service_name,
        }
        self.stream.produce(
            topic=event_type,
            key=data.get("transaction_id"),
            value=json.dumps(event),
        )

The correlation ID is generated at the system boundary (the first API call) and propagated through every event in the chain. Tracing infrastructure uses this ID to reconstruct the full transaction timeline across services, even when events were processed hours apart.

Design Tradeoffs

Event driven architecture shifts complexity from runtime failures to design-time decisions. The system handles failures more gracefully, but requires more upfront investment in schema design, consumer idempotency, and observability.

Decoupling over simplicity. Services can fail, scale, and deploy independently. The cost is that system behavior emerges from the interaction of independent components, making it harder to reason about locally. A developer working on the notification service must understand the events it consumes, not just its own code.

At-least-once delivery over exactly-once. At-least-once delivery is simpler, more robust, and available in every event system. Exactly-once adds significant infrastructure complexity for a guarantee that still requires idempotent consumers in practice. The duplication is handled at the application layer, not the infrastructure layer.

Event retention for replay over point-in-time state. Retaining events enables consumer recovery, backfilling, and auditing. The storage cost is significant but predictable. The alternative - reconstructing state from multiple service databases - is fragile, slow, and often incomplete.

Lessons Learned

Events do not eliminate coupling. They change the shape of coupling from runtime dependencies to schema dependencies. Managing event schemas with the same discipline as API contracts prevents the most common source of production incidents in event driven systems.

Consumer lag is the most important operational metric. When a consumer falls behind, it is not processing events in a timely manner. This might mean delayed notifications, stale analytics, or incomplete order processing. Alert on consumer lag before customers notice.

Dead letter queues need tooling, not just monitoring. A DLQ alert tells you something failed. Tooling that lets you inspect failed events, fix the issue, and replay them back through the consumer turns incidents into recoverable situations.

Event driven architecture is not a default choice. For services that always need synchronous responses, direct calls are simpler. Use events when services benefit from temporal decoupling - when the producer does not need to wait for the consumer to finish.