nSkillHub
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Reliability and Resilience: Circuit Breakers, Retries, SLOs, and Failure Modes

Reliability isn’t about preventing failures — it’s about building systems that fail gracefully, recover quickly, and maintain user trust even when things go wrong. This post covers the patterns that keep systems running under degraded conditions.


The Resilience Toolkit

Timeout

Set a maximum time to wait for any external call. Without timeouts, a slow dependency causes your threads to pile up waiting, eventually exhausting your thread pool.

Connection timeout: how long to wait to establish a connection
Read timeout: how long to wait for data once connected
Overall timeout: max end-to-end time (often the most important)

Common mistake: Setting timeouts too conservatively (tight) causes spurious failures. Too loose defeats the purpose. Start with p99 latency of the dependency × 2, then tune based on observed behavior.

Retry

Automatically retry failed requests. Handles transient failures (network glitch, brief overload) without user visibility.

Retry only:

  • Idempotent operations — retrying GET /users/123 is safe; retrying POST /payments is not (unless you have idempotency keys)
  • Transient failures (500, 503, timeouts) — not client errors (400, 401, 404)

Retry with exponential backoff + jitter:

attempt 1: fail → wait 100ms
attempt 2: fail → wait 200ms + random(0-50ms)
attempt 3: fail → wait 400ms + random(0-100ms)
attempt 4: give up

Jitter prevents the “thundering herd” — all failed requests retrying simultaneously and hammering the recovering service.

Circuit Breaker

Tracks the failure rate of calls to a dependency. When failures exceed a threshold, “opens” the circuit — subsequent calls fail fast without hitting the dependency. After a cooldown period, allows a probe request through. If it succeeds, the circuit “closes.”

CLOSED (normal): calls pass through, failure rate tracked
  ↓ failure rate > threshold (e.g., 50% in 10s window)
OPEN (degraded): calls fail immediately, no network I/O
  ↓ after cooldown (e.g., 30s)
HALF-OPEN: one probe request allowed through
  ↓ probe succeeds → CLOSED
  ↓ probe fails → OPEN (reset cooldown)

Why it matters: Without a circuit breaker, calls to a failed dependency keep trying, consuming threads and resources. The circuit breaker provides fast failure, which allows the calling service to handle the failure gracefully (fallback, error to user) rather than hanging.

Resilience4j is the standard Java implementation. Configurable via Spring Boot starters.

Bulkhead

Isolates failures to a limited scope. Named after ship compartments that contain flooding.

Thread pool bulkhead: Each external dependency gets its own thread pool. If calls to the inventory service hang and fill its thread pool, calls to the user service still have their threads available.

Semaphore bulkhead: Limits concurrent calls to a dependency. Simpler than thread pools; less isolation but lower overhead.

Kubernetes resource limits: At the infrastructure level, setting resource requests/limits per service ensures one service’s memory leak doesn’t starve others.

Rate Limiting

Limit how many requests a caller can make within a time window. Protects services from being overwhelmed.

Apply at:

  • API gateway: Rate limit per API key, per IP, per user
  • Service level: Rate limit incoming requests before processing
  • Client level: The calling service respects rate limits from dependencies

How Retries Make Outages Worse

This is the most important resilience failure mode to understand.

Scenario: Service B is slow (taking 5s per request instead of 100ms). Service A calls B with a 1s timeout and 3 retries.

  • A’s request takes 1s → timeout → retry → 1s → timeout → retry → 1s → final timeout
  • Each of A’s requests consumes 3 seconds of B’s capacity instead of 1
  • B is now receiving 3x its normal request volume
  • B gets slower (overloaded), A retries more, B gets slower…

This is a retry storm (or retry amplification). The retry behavior of clients under load amplifies the overload rather than relieving it.

Prevention:

  1. Exponential backoff + jitter — spread retry timing, reduce simultaneous retry bursts
  2. Circuit breaker — once failure rate is high, stop retrying and fail fast
  3. Max concurrency limits — Bulkhead prevents retry storms from consuming all available threads
  4. Retry budgets — at the system level, bound total retry volume (10% of calls may be retries; beyond that, fail)
  5. Idempotency + deduplication at the server — retries are safe because the server handles duplicates

SLOs and Error Budgets

SLO (Service Level Objective): A target reliability level for your service. “99.9% of requests complete in < 200ms” or “99.5% availability per month.”

SLI (Service Level Indicator): The measurement that tracks whether you’re meeting the SLO. The actual latency or error rate.

SLA (Service Level Agreement): A contractual commitment, usually with financial consequences. SLOs are internal targets; SLAs are external commitments.

Error budget: The inverse of the SLO. If SLO is 99.9%, the error budget is 0.1% — the amount of “bad” time or requests you’re allowed per period.

Why error budgets change behavior:

  • When the error budget is healthy → teams can ship faster (spending budget on experiments)
  • When the error budget is depleted → reliability work takes priority over features
  • This creates an automatic, objective-driven conversation between product and engineering. The SLO is the shared goal; the error budget is the operational dashboard.

Setting SLOs: Start with user-observable outcomes. “Can the user complete checkout?” is a meaningful SLO. “Is the recommendation service responding?” is a component metric, not a user-facing SLO. Aggregate from user journeys down to components.


Graceful Degradation

When a dependency fails, the system should degrade gracefully rather than fail completely.

Pattern: For each dependency, define what “no dependency” behavior looks like:

  • Recommendations service is down → show popular items (static fallback)
  • Personalization service is down → show generic content
  • Inventory service is slow → proceed with order, validate inventory async (accept the risk)
  • Auth cache is unavailable → route to auth service directly (slower, not broken)

Feature flags for dependencies: If a dependency is unreliable, wrap its calls in a feature flag. When it degrades, disable the flag — users don’t see the feature, but the core system stays up.


Poison Message Handling

A “poison message” is a message in a queue that causes the consumer to fail every time it processes it. Without handling, the consumer retries indefinitely, blocking all subsequent messages.

Solution: Dead Letter Queue (DLQ)

Configure a maximum number of delivery attempts (e.g., 5). After 5 failures, move the message to a DLQ. The main consumer processes normally; the DLQ holds messages for investigation.

Required practices:

  1. Alert on DLQ depth — a non-empty DLQ is always worth investigating
  2. Inspect and replay or discard from DLQ deliberately
  3. Include correlation IDs and error context in the DLQ message
  4. Audit — “what messages have we failed to process?” has compliance implications

What to investigate when a message is in the DLQ:

  • Bug in the consumer code (most common — schema change broke deserialization)
  • Invalid data in the message (upstream published a malformed event)
  • Transient dependency failure that became permanent (the DB it needed is gone)

Active-Active vs Active-Passive Multi-Region

Active-Passive:

  • One region handles all traffic (active)
  • Second region is on standby, ready to take over
  • Failover requires: detecting failure, promoting the passive region (DNS change, routing update), warming up caches
  • Simpler to operate, but failover takes minutes
  • Stale data in passive region if replication lag exists

Active-Active:

  • Both regions handle traffic simultaneously
  • Users routed to their nearest region
  • Writes must replicate between regions — consistency challenge
  • Any write in region A must be visible to region B readers within an acceptable window
  • Conflict resolution needed if both regions write the same record simultaneously

When active-active is worth it:

  • Global user base where cross-region latency hurts (US + EU + APAC)
  • Zero-downtime requirement — any single region failure is instantly absorbed by others
  • Compliance: data residency requirements may require certain users’ data to stay in a region

When it’s not:

  • You have predominantly single-region users
  • The consistency complexity (conflict resolution, replication lag handling) outweighs the availability benefit
  • Most teams overestimate the availability gap between a well-run active-passive and active-active

Middle ground: Active-passive with pre-warmed standby (cache primed, DB replica ready, smoke tests running) and automated failover < 2 minutes. This handles 95% of DR requirements without the consistency complexity of active-active.