Idempotency Is the Most Important Concept in Distributed Systems
A background job retried after a timeout.
The retry charged the customer again.
The payment system had no idempotency protection.
We refunded the duplicate charge within an hour.
But the real problem was not the refund.
It was discovering how many other places in the system could duplicate work.
Retries are inevitable in distributed systems.
Workers crash. Networks fail. Tasks time out. Queues redeliver messages.
If the system is not idempotent, the result is duplicated work: duplicate payments, duplicate orders, duplicate emails, duplicate messages.
Idempotency (the guarantee that an operation produces the same result whether executed once or many times) is not a feature. It is a correctness requirement.
Idempotency is the most broadly applicable concept in distributed systems engineering.
And most systems implement it in exactly one place.
Then forget it everywhere else.
Idempotency Beyond Payments
The payment use case is well understood: use idempotency keys to prevent duplicate charges. But idempotency applies everywhere that retries occur.
Duplicate execution shows up in subtle ways:
- Customers charged twice
- Emails sent twice
- Inventory decremented twice
- Metrics double-counted
- Workflows triggering the same action repeatedly
None of these failures look like distributed systems bugs. They look like product bugs. But they all have the same root cause: non-idempotent operations.
Distributed systems do not fail once.
They fail, retry, and fail again.
Correct systems must produce the same result every time.
Event Consumers
Event queues deliver messages at least once. Consumer crashes, network timeouts, and partition rebalancing all cause redelivery. An event consumer that inserts a row on each delivery creates duplicates. An event consumer that increments a counter on each delivery inflates the count.
# Non-idempotent consumer - breaks on redelivery
def handle_order_event(event):
db.execute("INSERT INTO orders VALUES (%s, %s)",
event.order_id, event.amount)
# Duplicate delivery → duplicate row
# Idempotent consumer - safe on redelivery
def handle_order_event(event):
db.execute("""
INSERT INTO orders (id, amount)
VALUES (%s, %s)
ON CONFLICT (id) DO NOTHING
""", event.order_id, event.amount)
# Duplicate delivery → no-op
Workflow Tasks
Workflow engines retry failed tasks. A task that sends an email, calls an API, or updates a database must produce the same outcome on retry. If a task creates a resource, the retry must detect the existing resource instead of creating a duplicate.
class CreateUserTask:
def execute(self, workflow_context):
existing = self.user_repo.find_by_email(
workflow_context["email"]
)
if existing:
return existing # Idempotent: return existing
user = self.user_repo.create(
email=workflow_context["email"],
name=workflow_context["name"],
)
return user
Data Pipeline Stages
Pipeline stages that are retried must produce the same output regardless of how many times they run. The standard approach is to make each stage fully replace its output rather than append to it.
# Non-idempotent: appends on retry
def transform_daily_orders(date):
orders = extract_orders(date)
transformed = transform(orders)
db.execute("INSERT INTO warehouse.orders ...", transformed)
# Retry → duplicate rows
# Idempotent: replaces partition on retry
def transform_daily_orders(date):
orders = extract_orders(date)
transformed = transform(orders)
db.execute("DELETE FROM warehouse.orders WHERE date = %s", date)
db.execute("INSERT INTO warehouse.orders ...", transformed)
# Retry → same result
AI Agent Actions
AI agents reintroduce the same failure mode in a new form. The only difference is that the retries are now hidden behind model-driven workflows.
Agents call tools. Tools execute side effects. If the agent crashes before recording the result, the workflow engine retries the step.
The tool executes again.
A tool that sends a message sends it twice. A tool that creates a calendar event creates two events. A tool that writes to a database writes duplicate records.
// Idempotent tool execution for AI agents
async function executeToolIdempotent(
workflowId: string,
stepId: string,
tool: Tool,
args: unknown
) {
const key = `${workflowId}:${stepId}:${tool.name}`;
const cached = await idempotencyStore.get(key);
if (cached) return cached.result;
const result = await tool.execute(args);
await idempotencyStore.set(key, { result, timestamp: Date.now() });
return result;
}
Every system that executes actions automatically -- event consumers, workflow tasks, pipeline stages, AI tool calls -- needs idempotency guarantees. The mechanism varies but the principle is universal: an operation executed twice must produce the same result as an operation executed once. Reliable automation requires idempotent actions.
The Three Patterns
Most idempotency implementations use one of three patterns:
Natural idempotency. Some operations are inherently idempotent. Setting a value (PUT) is idempotent. Reading data is idempotent. Deleting a specific record is idempotent. These require no additional infrastructure.
Idempotency keys. The caller generates a unique key for each logical operation. The server checks the key before processing. If the key exists, the server returns the cached result. This is the standard approach for mutating API endpoints.
Conditional writes. The operation includes a precondition that ensures it only executes once. INSERT ... ON CONFLICT DO NOTHING. UPDATE ... WHERE version = N. DELETE ... WHERE status = 'pending'. The database enforces the idempotency constraint.
-- Pattern 1: Natural idempotency (no extra work)
UPDATE users SET email = 'new@example.com' WHERE id = 123;
-- Pattern 2: Idempotency key (explicit dedup)
INSERT INTO idempotency_keys (key, result)
VALUES ('req-abc-123', '{"status": "ok"}')
ON CONFLICT (key) DO NOTHING;
-- Pattern 3: Conditional write (precondition)
UPDATE orders SET status = 'shipped'
WHERE id = 456 AND status = 'processing';
-- Only executes once: second attempt finds
-- status = 'shipped', not 'processing'
The Atomic Transaction Requirement
The most common idempotency bug is a gap between the operation and the dedup record. If the operation completes but the dedup record is not written (due to a crash, network error, or separate transaction), the retry will not find the dedup record and will execute the operation again.
The fix is simple in principle: the operation and the dedup record must be in the same atomic transaction.
# WRONG: separate operations
def process_payment(idempotency_key, amount):
charge = stripe.charge(amount) # Step 1
db.insert_idem_key(idempotency_key) # Step 2
# Crash between Step 1 and Step 2
# → retry charges again
# RIGHT: atomic transaction
def process_payment(idempotency_key, amount):
with db.transaction() as tx:
if tx.idem_key_exists(idempotency_key):
return tx.get_cached_result(idempotency_key)
charge = stripe.charge(amount)
tx.insert_idem_key(idempotency_key, charge)
tx.insert_payment_record(charge)
# All or nothing. Crash rolls back everything.
In practice this is harder when operations span multiple systems (a database and an external API). The external API call cannot be rolled back. The solution is to make the external call idempotent independently (using the external service's own idempotency key) and record the result atomically with the dedup key.
Why Most Systems Get It Wrong
Idempotency is conceptually simple and operationally difficult. Teams know they need it. They add idempotency keys to their API layer. Then they forget about:
- Event consumers that process queue messages without dedup
- Workflow tasks that create resources without checking for existing ones
- AI agents that retry tool calls without tracking previous executions
- Pipeline stages that append rather than replace
- Background jobs and webhook handlers that run without checking previous results
Each of these is a retry boundary. Each needs idempotency guarantees. The system is only as reliable as its weakest boundary.
In production, duplicate execution is rarely obvious.
The first signal is often subtle: an email sent twice, a support ticket created twice, a webhook delivered twice. Teams often treat these as isolated bugs.
They are not. They are symptoms of missing idempotency guarantees.
As more work moves to automated platforms and agent-driven workflows, the surface area for duplicate execution grows. Every tool call an agent makes, every task a workflow retries, every event a consumer reprocesses is a place where the system must guarantee exactly-once effect.
Distributed systems retry by default.
Distributed systems do not guarantee actions run once. They guarantee actions may run many times.
Queues redeliver. Workflow engines retry. Background jobs restart. Agents repeat tool calls.
Systems that assume actions run once eventually corrupt data.
Reliable systems assume every action might run twice.
Then design so it does not matter.
References
- Designing robust and predictable APIs with idempotency· Stripe Engineering Blog
- Implementing Stripe-like Idempotency Keys in Postgres· Brandur Leach
- How Temporal Works· Temporal
- Exactly-once Semantics are Possible· Confluent Blog
- Exponential Backoff And Jitter· AWS Architecture Blog
Next
The Hidden Failure Modes of Event-Driven Systems →Event ordering, retry amplification, and the hidden coupling that turns multi-step workflows into debugging archaeology.