Why Resilience Patterns?
In a microservices architecture, service A calls service B which calls service C. If C is slow or failing: B’s threads block waiting for C. B’s thread pool fills up. A’s calls to B start failing too. The cascade continues up the chain: one failing downstream service takes down the entire system. Resilience patterns prevent this cascade. The three key patterns: Circuit Breaker (stop calling a failing service), Retry with backoff (handle transient failures), and Bulkhead (isolate failures to a pool).
Circuit Breaker
A circuit breaker wraps calls to an external service. Three states: CLOSED (normal operation — calls pass through), OPEN (failing — calls are rejected immediately without attempting the downstream call), HALF-OPEN (testing recovery — a limited number of probe requests are allowed through). State transitions: CLOSED → OPEN: when the failure rate exceeds a threshold (e.g., 50% failures in the last 60 seconds, with a minimum of 20 requests). OPEN → HALF-OPEN: after a timeout (e.g., 30 seconds). HALF-OPEN → CLOSED: if probe requests succeed. HALF-OPEN → OPEN: if probe requests fail again.
class CircuitBreaker:
def __init__(self, failure_threshold=0.5, timeout=30, min_requests=20):
self.state = "CLOSED"
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.failure_threshold = failure_threshold
self.timeout = timeout
self.min_requests = min_requests
def call(self, fn, *args):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit is open")
try:
result = fn(*args)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
self.success_count += 1
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
total = self.failure_count + self.success_count
if total >= self.min_requests and self.failure_count / total >= self.failure_threshold:
self.state = "OPEN"
Retry with Exponential Backoff and Jitter
Retry handles transient failures (momentary network blip, brief service restart). Naive retry: immediately retry on failure. Problem: if the service is overloaded, immediate retries add more load — stampeding herd effect. Exponential backoff: wait 1s, then 2s, then 4s, then 8s (base * 2^attempt). Reduces load on the recovering service. Jitter: add random variance to the backoff — actual_wait = backoff * (0.5 + random() * 0.5). Without jitter, all clients back off by the same amount and retry simultaneously (synchronized thundering herd). Jitter spreads retries out. Retry only on retriable errors (5xx, timeouts). Never retry non-idempotent operations (POST payment) without idempotency keys. Max retries: 3-5. After max retries: fail fast and return an error.
Bulkhead Pattern
A bulkhead isolates failures to a limited resource pool, preventing a failing call from consuming all shared resources. Thread pool bulkhead: assign a dedicated thread pool for each downstream service. Service A’s calls to Service B use Pool B (10 threads). Calls to Service C use Pool C (5 threads). If Service B hangs and all 10 threads are blocked, Service C calls still work (using Pool C). Without bulkheads, a single hung downstream service consumes all threads in the shared pool, blocking all other calls. Semaphore bulkhead: limit concurrent calls to a downstream service to N. If N are already in flight, reject new calls with a fallback response. Hystrix (Netflix) popularized bulkheads; modern alternatives: Resilience4j (Java), Polly (.NET), Envoy proxy (service mesh level).
Timeout
Every external call must have a timeout. Without a timeout: a hung downstream call blocks a thread indefinitely. Timeout values: P99 response time of the downstream service * 1.5. Too short: too many false timeouts. Too long: resources blocked for too long on real failures. Cascading timeouts: if A calls B which calls C, set A→B timeout > B→C timeout. This ensures B can respond to A with an error before A’s timer fires. Async timeouts: for async operations (message queue consumers), use a deadline propagated via message headers. If the message is processed after the deadline, discard the result.
Fallback Strategies
When a circuit is open or a call times out, a fallback provides a degraded but functional response: return cached data (last successful response for this request), return a default value (empty list, zero count), serve a static response (pre-rendered HTML fallback page), or queue the request for later processing. Fallback selection depends on the use case: a product recommendation fallback might return bestsellers instead of personalized recommendations. A payment service fallback should never be a silent no-op — fail loudly. Design for partial functionality: if the review service is down, show the product page without reviews rather than showing an error page for the entire site.
Asked at: Netflix Interview Guide
Asked at: Uber Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Databricks Interview Guide