Distributed Workflow Engines

9 minute read|Published December 2025

Overview

Most backend systems eventually need to coordinate multi-step processes that span multiple services and take minutes, hours, or days to complete. Clinician approval workflows. Pharmacy fulfillment pipelines. Document generation chains. Payment settlement sequences.

These are not simple request-response operations. They have dependencies between steps, require retries with specific backoff strategies, must survive service restarts, and need visibility into where a workflow is stuck when something goes wrong.

The naive approach - chaining service calls with ad-hoc retry logic scattered across codebases - works until it does not. When it fails, you get invisible failures, inconsistent retry behavior, and no way to answer the question: "What happened to order 47291?"

A workflow engine centralizes this coordination. It owns the state machine, manages retries, and provides observability into every running workflow.

Problems

Architecture

Loading diagram...
Workflow engine architecture. The coordinator manages workflow state while workers execute individual tasks.

The workflow engine receives a request to start a workflow and creates a state machine representing the steps. It dispatches tasks to a queue, and worker processes pick up and execute them. On completion or failure, workers report back to the engine, which advances the state machine.

State is persisted durably. If the engine restarts, it resumes workflows from their last persisted state. If a worker crashes mid-task, the task times out and is requeued for another worker.

Failed tasks that exceed their retry budget are moved to a dead letter queue for manual inspection. The engine tracks every state transition, providing a complete audit trail of workflow execution.

Key Engineering Challenges

State Machine Design

The state machine must handle not just the happy path but every failure mode: task timeouts, worker crashes, partial completions, and out-of-order callbacks.

class WorkflowState:
    PENDING = "pending"
    RUNNING = "running"
    WAITING = "waiting"
    COMPLETED = "completed"
    FAILED = "failed"
    COMPENSATING = "compensating"

class WorkflowStep:
    def __init__(self, name, handler, compensation=None):
        self.name = name
        self.handler = handler
        self.compensation = compensation
        self.state = WorkflowState.PENDING
        self.attempts = 0
        self.max_attempts = 3
        self.last_error = None

Each step tracks its own state, attempt count, and last error. The engine evaluates step dependencies before dispatching - a step only becomes RUNNING when all its prerequisites are COMPLETED. If a step exhausts its retries, the engine enters COMPENSATING state and runs compensation handlers for all previously completed steps in reverse order.

Retry Strategy Configuration

Different tasks need different retry behavior. A payment authorization retry should be aggressive (short intervals, few attempts) because authorization holds expire. A document generation retry can be patient (longer intervals, more attempts) because the operation is expensive but not time-sensitive.

const workflowDefinition = {
  steps: [
    {
      name: "authorize_payment",
      retry: { maxAttempts: 3, backoff: "linear", interval: "1s" },
      timeout: "10s",
      compensation: "void_authorization",
    },
    {
      name: "create_order",
      retry: { maxAttempts: 5, backoff: "exponential", interval: "2s" },
      timeout: "30s",
      compensation: "cancel_order",
      dependsOn: ["authorize_payment"],
    },
    {
      name: "generate_documents",
      retry: { maxAttempts: 10, backoff: "exponential", interval: "30s" },
      timeout: "5m",
      dependsOn: ["create_order"],
    },
  ],
};

The workflow definition is declarative. Engineers define steps, dependencies, retry policies, and compensations. The engine handles execution, state transitions, and failure recovery. This separates business logic (what to do) from infrastructure logic (how to coordinate it).

Workflow Versioning

Long-running workflows span deployments. When the workflow definition changes between versions, in-flight workflows must continue executing under their original definition.

The solution is to version workflow definitions and pin each workflow instance to the version that started it. New workflows use the latest definition. Existing workflows complete under their original version.

This requires the engine to maintain multiple workflow definitions simultaneously and route execution based on the version stored with each workflow instance.

Design Tradeoffs

Centralized orchestration trades flexibility for observability. Every workflow is visible, debuggable, and auditable from a single system. The cost is that all workflow changes must go through the engine.

Orchestration over choreography. In a choreography model, services react to events and coordinate implicitly. This is flexible but makes it nearly impossible to answer "where is this workflow stuck?" Orchestration centralizes visibility at the cost of introducing a coordination dependency.

At-least-once delivery with idempotent workers over exactly-once semantics. Exactly-once delivery is expensive and fragile. At-least-once is simple and reliable, but requires every worker to handle duplicate task deliveries gracefully. This pushes idempotency responsibility to the edges, which is manageable with idempotency keys.

Persistent state over in-memory execution. Persisting every state transition adds latency (a database write per step transition) but enables crash recovery, audit trails, and workflow inspection. For workflows that take hours or days, this tradeoff is unambiguous.

Lessons Learned

Centralized orchestration trades flexibility for observability, and at scale, observability wins. Being able to query "show me all stuck workflows" across the entire system is worth the coordination overhead.

Retry budgets prevent more outages than circuit breakers alone. A global limit on retry volume protects downstream services from retry storms that per-call circuit breakers miss.

Workflow versioning is not optional for any system with long-running processes. Plan for it from the beginning. Retrofitting versioning into an existing engine requires migrating all in-flight workflow state, which is painful.

The dead letter queue is the most operationally important component. When production issues arise, the DLQ is where you find the evidence. Invest in tooling that makes DLQ inspection and replay easy.