Advanced Distributed Systems ConceptsLesson 6.5
How to design for failure — circuit breakers and bulkheads
circuit breaker pattern, closed/open/half-open states, bulkhead isolation, retry with backoff, timeout configuration, cascading failure prevention
Why Services Fail Each Other
In microservices, a slow downstream service causes upstream callers to pile up waiting for responses, eventually exhausting thread pools and crashing the caller. This is cascading failure. Circuit breakers stop it.
Circuit Breaker States
- Closed: requests flow normally. Track error rate.
- Open: error rate exceeded threshold. Requests fail immediately without calling the downstream service. Caller gets fast failure.
- Half-Open: after a timeout, allow a probe request. If it succeeds, return to Closed. If it fails, back to Open.
// Circuit breaker with Resilience4j (Java)
CircuitBreaker cb = CircuitBreaker.ofDefaults("payment-service");
Supplier decoratedSupplier = CircuitBreaker
.decorateSupplier(cb, paymentService::processPayment);
Try result = Try.ofSupplier(decoratedSupplier)
.recover(CallNotPermittedException.class, ex -> "fallback-response"); Bulkhead Pattern
Isolate resources between services. Give each downstream dependency its own thread pool. If payment service is slow and fills its thread pool, search service's thread pool is unaffected.
Retry with Exponential Backoff
def retry(fn, max_attempts=3, base_delay=0.1):
for attempt in range(max_attempts):
try:
return fn()
except TransientError:
time.sleep(base_delay * (2 ** attempt))
raise MaxRetriesExceeded()Add jitter to backoff to prevent retry storms when many clients fail simultaneously.
