Distributed Workflow Engines
Overview
Most backend systems eventually need to coordinate multi-step processes that span multiple services and take minutes, hours, or days to complete. Clinician approval workflows. Pharmacy fulfillment pipelines. Document generation chains. Payment settlement sequences.
These are not simple request-response operations. They have dependencies between steps, require retries with specific backoff strategies, must survive service restarts, and need visibility into where a workflow is stuck when something goes wrong.
The naive approach - chaining service calls with ad-hoc retry logic scattered across codebases - works until it does not. When it fails, you get invisible failures, inconsistent retry behavior, and no way to answer the question: "What happened to order 47291?"
A workflow engine centralizes this coordination. It owns the state machine, manages retries, and provides observability into every running workflow.
Problems
-
Scattered retry logic. Without centralized orchestration, each service implements its own retry behavior. One service retries 3 times with no backoff. Another retries 10 times with exponential backoff. A third does not retry at all. When failures cascade, the system behavior is unpredictable because retry policies are inconsistent.
-
Invisible workflow state. When a multi-step process stalls, diagnosing the problem requires reading logs across multiple services, correlating timestamps, and reconstructing the sequence of events manually. There is no single place to see: "Step 3 of 5 failed, here is the error, here are the previous attempts."
-
Long-running workflows outlive deployments. A fulfillment workflow that takes 48 hours will span multiple service deployments. If the workflow state lives in-memory, a deployment kills it. If it lives in a database, schema changes during deployment can corrupt it.
-
Task dependencies with conditional branching. Real workflows are not linear. Approval might branch into "approved" and "rejected" paths. Each path triggers different downstream tasks. Modeling this in ad-hoc code produces brittle, hard-to-test state machines.
Architecture
The workflow engine receives a request to start a workflow and creates a state machine representing the steps. It dispatches tasks to a queue, and worker processes pick up and execute them. On completion or failure, workers report back to the engine, which advances the state machine.
State is persisted durably. If the engine restarts, it resumes workflows from their last persisted state. If a worker crashes mid-task, the task times out and is requeued for another worker.
Failed tasks that exceed their retry budget are moved to a dead letter queue for manual inspection. The engine tracks every state transition, providing a complete audit trail of workflow execution.
Key Engineering Challenges
State Machine Design
The state machine must handle not just the happy path but every failure mode: task timeouts, worker crashes, partial completions, and out-of-order callbacks.
class WorkflowState:
PENDING = "pending"
RUNNING = "running"
WAITING = "waiting"
COMPLETED = "completed"
FAILED = "failed"
COMPENSATING = "compensating"
class WorkflowStep:
def __init__(self, name, handler, compensation=None):
self.name = name
self.handler = handler
self.compensation = compensation
self.state = WorkflowState.PENDING
self.attempts = 0
self.max_attempts = 3
self.last_error = None
Each step tracks its own state, attempt count, and last error. The engine evaluates step dependencies before dispatching - a step only becomes RUNNING when all its prerequisites are COMPLETED. If a step exhausts its retries, the engine enters COMPENSATING state and runs compensation handlers for all previously completed steps in reverse order.
Retry Strategy Configuration
Different tasks need different retry behavior. A payment authorization retry should be aggressive (short intervals, few attempts) because authorization holds expire. A document generation retry can be patient (longer intervals, more attempts) because the operation is expensive but not time-sensitive.
const workflowDefinition = {
steps: [
{
name: "authorize_payment",
retry: { maxAttempts: 3, backoff: "linear", interval: "1s" },
timeout: "10s",
compensation: "void_authorization",
},
{
name: "create_order",
retry: { maxAttempts: 5, backoff: "exponential", interval: "2s" },
timeout: "30s",
compensation: "cancel_order",
dependsOn: ["authorize_payment"],
},
{
name: "generate_documents",
retry: { maxAttempts: 10, backoff: "exponential", interval: "30s" },
timeout: "5m",
dependsOn: ["create_order"],
},
],
};
The workflow definition is declarative. Engineers define steps, dependencies, retry policies, and compensations. The engine handles execution, state transitions, and failure recovery. This separates business logic (what to do) from infrastructure logic (how to coordinate it).
Workflow Versioning
Long-running workflows span deployments. When the workflow definition changes between versions, in-flight workflows must continue executing under their original definition.
The solution is to version workflow definitions and pin each workflow instance to the version that started it. New workflows use the latest definition. Existing workflows complete under their original version.
This requires the engine to maintain multiple workflow definitions simultaneously and route execution based on the version stored with each workflow instance.
Design Tradeoffs
Centralized orchestration trades flexibility for observability. Every workflow is visible, debuggable, and auditable from a single system. The cost is that all workflow changes must go through the engine.
Orchestration over choreography. In a choreography model, services react to events and coordinate implicitly. This is flexible but makes it nearly impossible to answer "where is this workflow stuck?" Orchestration centralizes visibility at the cost of introducing a coordination dependency.
At-least-once delivery with idempotent workers over exactly-once semantics. Exactly-once delivery is expensive and fragile. At-least-once is simple and reliable, but requires every worker to handle duplicate task deliveries gracefully. This pushes idempotency responsibility to the edges, which is manageable with idempotency keys.
Persistent state over in-memory execution. Persisting every state transition adds latency (a database write per step transition) but enables crash recovery, audit trails, and workflow inspection. For workflows that take hours or days, this tradeoff is unambiguous.
Lessons Learned
Centralized orchestration trades flexibility for observability, and at scale, observability wins. Being able to query "show me all stuck workflows" across the entire system is worth the coordination overhead.
Retry budgets prevent more outages than circuit breakers alone. A global limit on retry volume protects downstream services from retry storms that per-call circuit breakers miss.
Workflow versioning is not optional for any system with long-running processes. Plan for it from the beginning. Retrofitting versioning into an existing engine requires migrating all in-flight workflow state, which is painful.
The dead letter queue is the most operationally important component. When production issues arise, the DLQ is where you find the evidence. Invest in tooling that makes DLQ inspection and replay easy.
References
- How Temporal Works- Temporal
- Inngest: Durable Workflow Engine- Inngest Documentation
- Exponential Backoff And Jitter- AWS Architecture Blog
- Conductor: Netflix's Workflow Orchestration Engine- Netflix Technology Blog
- Google SRE Book: Handling Overload- Google SRE