Retry Systems in Distributed Infrastructure
Every distributed system fails. Networks partition. Services crash. Databases hit capacity limits. The question is not whether failures happen but how the system responds when they do.
Retry logic is the first line of defense. It is also one of the most common sources of cascading failures when implemented poorly.
The Naive Approach
The simplest retry implementation is an immediate retry loop. When a request fails, send it again. When that fails, send it again. Continue until success or exhaustion.
This approach fails catastrophically under load. If a downstream service is struggling with capacity, immediate retries amplify the load. A service handling 1000 requests per second that fails 50% of them now receives 1500 requests per second from retries. The failure rate increases. More retries follow. The service collapses.
This is a retry storm, and it is one of the most common causes of cascading outages in distributed systems.
Exponential Backoff
The standard mitigation is exponential backoff. Instead of retrying immediately, the client waits an increasing duration between attempts. First retry after 100ms, then 200ms, then 400ms, and so on.
Exponential backoff reduces load on a struggling service, giving it time to recover. But it has a subtle problem: if many clients start retrying at the same time, their backoff schedules synchronize. They all retry at the same intervals, creating periodic load spikes.
Jitter
Adding randomness to the backoff interval, known as jitter, breaks this synchronization. Instead of waiting exactly 400ms before the third retry, the client waits a random duration between 0 and 400ms.
Full jitter provides the best load distribution. Each retry occurs at a uniformly random time within the backoff window. This smooths the retry load across time and prevents the thundering herd problem.
The combination of exponential backoff with full jitter is the baseline for any production retry system.
Circuit Breakers
Retries help with transient failures but make sustained failures worse. If a service is down for an extended period, retries waste resources and delay error propagation.
Circuit breakers address this. The circuit breaker monitors failure rates. When failures exceed a threshold, it opens the circuit and immediately fails subsequent requests without attempting the call. After a cooldown period, it allows a limited number of probe requests through. If those succeed, the circuit closes and normal traffic resumes.
The state machine is simple:
- Closed: requests flow normally, failures are counted
- Open: requests fail immediately, no downstream calls
- Half-open: limited probe requests test recovery
Circuit breakers protect both the calling service, by failing fast, and the downstream service, by reducing load during recovery.
Retry Budgets
A more sophisticated approach is retry budgets. Instead of configuring retry behavior per call, the system allocates a global retry budget as a percentage of total requests.
For example, a retry budget of 10% means that retries cannot exceed 10% of the total request volume. If the system is sending 1000 requests per second and 100 retries per second, the budget is exhausted. Additional failures are not retried.
Retry budgets provide a global guarantee on retry amplification. Regardless of failure rates, the system will never more than double its load from retries with a reasonable budget.
Distinguishing Failure Types
Not all failures deserve retries. A timeout might resolve on retry. A 400 Bad Request will fail every time. A 429 Too Many Requests tells you to back off.
Classify failures into:
- Retryable: timeouts, connection errors, 503 Service Unavailable
- Non-retryable: client errors, validation failures, authentication errors
- Throttled: rate limit responses that include retry-after headers
Retrying non-retryable errors wastes resources and delays error reporting to the caller. Always classify before retrying.
Observability
Retry systems need dedicated observability. Track:
- Retry rates by service and endpoint
- Retry success rates, how often retries actually help
- Circuit breaker state transitions
- Retry budget utilization
- Latency impact of retries on end-to-end request duration
A high retry rate with a low retry success rate indicates a sustained failure that retries cannot resolve. A circuit breaker should be opening. If retry rates are consistently high, the underlying reliability problem needs attention.
Practical Guidelines
After operating retry systems across payment and transaction infrastructure:
- Start with exponential backoff and full jitter
- Set maximum retry counts conservatively, typically 2 to 3 attempts
- Implement circuit breakers for every external dependency
- Use retry budgets to cap amplification globally
- Never retry non-idempotent operations without idempotency keys
- Monitor retry metrics as primary system health indicators
- Test retry behavior under sustained failures, not just transient ones
Retry logic is simple to implement and difficult to get right. The difference between a system that recovers gracefully and one that collapses under failure is usually in these details.
References
- Exponential Backoff And Jitter- AWS Architecture Blog
- Circuit Breaker Pattern- Martin Fowler
- Handling Retries with Durable Execution- Temporal Blog
- Keeping Netflix Reliable Using Prioritized Load Shedding- Netflix Technology Blog
- Google SRE Book: Handling Overload- Google SRE