Designing Idempotent Payment Systems
Payment systems process the same transaction more than once all the time. Networks drop. Clients retry. Load balancers reroute. If your system cannot handle duplicate requests gracefully, you will eventually charge someone twice.
Idempotency is not a feature. It is a correctness requirement for any system that processes financial transactions.
The Problem
Consider a simple payment authorization flow. A client sends a request to authorize a charge. The server processes it, debits the account, and returns a success response. But the response never reaches the client due to a network timeout.
The client retries. Without idempotency, the server processes the request again. The customer is now charged twice.
This is not a theoretical concern. At scale, duplicate transactions are inevitable. The question is whether your system handles them correctly.
Idempotency Keys
The standard approach is idempotency keys. The client generates a unique key for each logical operation and includes it with every request. The server uses this key to detect duplicates.
The implementation has three critical components:
- Key storage with atomic check-and-set semantics
- Request deduplication before any side effects
- Cached response replay for duplicate requests
The key insight is that deduplication must happen before any state mutation. If you debit an account and then check for duplicates, you have already caused harm.
Storage Design
Idempotency keys need a storage layer that supports atomic operations. A typical approach uses a dedicated table:
idempotency_keys
key TEXT PRIMARY KEY
request JSONB
response JSONB
status ENUM('processing', 'complete', 'error')
created_at TIMESTAMP
When a request arrives, the system attempts an atomic insert. If the insert succeeds, the request is new. If it fails due to a conflict, the request is a duplicate.
For duplicates, the system checks the status. If complete, it replays the cached response. If still processing, it returns a conflict error telling the client to retry later.
Distributed Considerations
In a distributed system, idempotency becomes more complex. Requests may arrive at different nodes. The idempotency store must be centralized or use distributed consensus.
Most production systems use a centralized store like Redis or PostgreSQL for idempotency keys, with careful attention to:
- TTL policies to prevent unbounded growth
- Partition strategies for high throughput
- Consistency guarantees under node failures
The store itself becomes a critical dependency. If it goes down, you face a choice: reject all requests or risk duplicates. Most payment systems choose to reject, because correctness matters more than availability for financial transactions.
Beyond Simple Deduplication
True idempotency goes beyond deduplication. The system must ensure that retried requests produce the same observable side effects as the original.
This means:
- Downstream API calls must also be idempotent or wrapped in idempotent layers
- Database operations must be designed for safe replay
- Event emissions must be deduplicated or consumers must handle duplicates
The entire transaction pipeline, from ingress to the last side effect, must be idempotent. A single non-idempotent step breaks the guarantee.
Failure Modes
The most dangerous failure mode is a partial completion. The system debits the account but crashes before writing the idempotency key. On retry, the key does not exist, so the system processes the request again.
The solution is to write the idempotency key and perform the state mutation in the same atomic transaction. In practice, this means using database transactions that include both the idempotency key insert and the business logic mutations.
For operations that span multiple services, you need distributed transaction patterns like sagas with compensation logic, ensuring that partial failures can be safely rolled back or completed.
Practical Recommendations
After building payment processing systems that handle millions of transactions:
- Always require idempotency keys for mutating operations
- Store keys with sufficient TTL, typically 24 to 48 hours
- Use atomic check-and-set, never check-then-set
- Design every downstream call to be idempotent
- Monitor duplicate rates as a system health metric
- Test failure modes explicitly, including partial completions
Idempotency is foundational infrastructure. Get it wrong and you lose money and trust. Get it right and your system can handle the messy reality of distributed networks without compromising correctness.
References
- Designing robust and predictable APIs with idempotency- Stripe Engineering Blog
- Implementing Stripe-like Idempotency Keys in Postgres- Brandur Leach
- Sagas- Temporal Blog
- Avoiding Double Payments in a Distributed Payments System- Airbnb Engineering