System Design: Circuit Breaker, Retry, and Bulkhead — Resilience Patterns for Microservices

Why Resilience Patterns?

In a microservices architecture, service A calls service B which calls service C. If C is slow or failing: B’s threads block waiting for C. B’s thread pool fills up. A’s calls to B start failing too. The cascade continues up the chain: one failing downstream service takes down the entire system. Resilience patterns prevent this cascade. The three key patterns: Circuit Breaker (stop calling a failing service), Retry with backoff (handle transient failures), and Bulkhead (isolate failures to a pool).

Circuit Breaker

A circuit breaker wraps calls to an external service. Three states: CLOSED (normal operation — calls pass through), OPEN (failing — calls are rejected immediately without attempting the downstream call), HALF-OPEN (testing recovery — a limited number of probe requests are allowed through). State transitions: CLOSED → OPEN: when the failure rate exceeds a threshold (e.g., 50% failures in the last 60 seconds, with a minimum of 20 requests). OPEN → HALF-OPEN: after a timeout (e.g., 30 seconds). HALF-OPEN → CLOSED: if probe requests succeed. HALF-OPEN → OPEN: if probe requests fail again.

class CircuitBreaker:
    def __init__(self, failure_threshold=0.5, timeout=30, min_requests=20):
        self.state = "CLOSED"
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.min_requests = min_requests

    def call(self, fn, *args):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit is open")
        try:
            result = fn(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == "HALF_OPEN":
            self.state = "CLOSED"
            self.failure_count = 0
        self.success_count += 1

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        total = self.failure_count + self.success_count
        if total >= self.min_requests and self.failure_count / total >= self.failure_threshold:
            self.state = "OPEN"

Retry with Exponential Backoff and Jitter

Retry handles transient failures (momentary network blip, brief service restart). Naive retry: immediately retry on failure. Problem: if the service is overloaded, immediate retries add more load — stampeding herd effect. Exponential backoff: wait 1s, then 2s, then 4s, then 8s (base * 2^attempt). Reduces load on the recovering service. Jitter: add random variance to the backoff — actual_wait = backoff * (0.5 + random() * 0.5). Without jitter, all clients back off by the same amount and retry simultaneously (synchronized thundering herd). Jitter spreads retries out. Retry only on retriable errors (5xx, timeouts). Never retry non-idempotent operations (POST payment) without idempotency keys. Max retries: 3-5. After max retries: fail fast and return an error.

Bulkhead Pattern

A bulkhead isolates failures to a limited resource pool, preventing a failing call from consuming all shared resources. Thread pool bulkhead: assign a dedicated thread pool for each downstream service. Service A’s calls to Service B use Pool B (10 threads). Calls to Service C use Pool C (5 threads). If Service B hangs and all 10 threads are blocked, Service C calls still work (using Pool C). Without bulkheads, a single hung downstream service consumes all threads in the shared pool, blocking all other calls. Semaphore bulkhead: limit concurrent calls to a downstream service to N. If N are already in flight, reject new calls with a fallback response. Hystrix (Netflix) popularized bulkheads; modern alternatives: Resilience4j (Java), Polly (.NET), Envoy proxy (service mesh level).

Timeout

Every external call must have a timeout. Without a timeout: a hung downstream call blocks a thread indefinitely. Timeout values: P99 response time of the downstream service * 1.5. Too short: too many false timeouts. Too long: resources blocked for too long on real failures. Cascading timeouts: if A calls B which calls C, set A→B timeout > B→C timeout. This ensures B can respond to A with an error before A’s timer fires. Async timeouts: for async operations (message queue consumers), use a deadline propagated via message headers. If the message is processed after the deadline, discard the result.

Fallback Strategies

When a circuit is open or a call times out, a fallback provides a degraded but functional response: return cached data (last successful response for this request), return a default value (empty list, zero count), serve a static response (pre-rendered HTML fallback page), or queue the request for later processing. Fallback selection depends on the use case: a product recommendation fallback might return bestsellers instead of personalized recommendations. A payment service fallback should never be a silent no-op — fail loudly. Design for partial functionality: if the review service is down, show the product page without reviews rather than showing an error page for the entire site.

Asked at: Netflix Interview Guide

Asked at: Uber Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Databricks Interview Guide

Scroll to Top