nSkillHub
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Microservices Patterns: Saga, CQRS, Event Sourcing, BFF, and More

Microservices patterns are the vocabulary of distributed systems design. Knowing when to apply each one — and when not to — separates an architect who reads pattern books from one who’s shipped production systems.


Saga Pattern

Problem: A business transaction spans multiple services, each with its own database. You can’t use a distributed ACID transaction.

Solution: A saga is a sequence of local transactions. Each step publishes an event or triggers the next step. If a step fails, compensating transactions undo previous steps.

Choreography-based saga: Services react to events — no central coordinator.

1. OrderService: creates order → publishes OrderCreated
2. InventoryService: listens → reserves stock → publishes StockReserved
3. PaymentService: listens → charges card → publishes PaymentCompleted
4. OrderService: listens → confirms order

Failure at step 3:
3. PaymentService: charge fails → publishes PaymentFailed
2. InventoryService: listens → releases reservation → publishes StockReleased
1. OrderService: listens → cancels order

Orchestration-based saga: A saga orchestrator (a service or workflow engine) explicitly coordinates each step.

SagaOrchestrator:
  step 1: call InventoryService.reserve() → success
  step 2: call PaymentService.charge()   → fails
  step 3: call InventoryService.release() (compensate)
  → return failure

When to use which:

  • Choreography: fewer services, loose coupling desired, simple failure paths
  • Orchestration: many services, complex failure compensation, need visibility into saga state

Real pitfalls:

  • Compensating transactions must be idempotent. The network might redeliver a compensation event.
  • Partial failures are hard to reason about. What if the compensation itself fails?
  • Visibility: Where is the saga in its lifecycle? Orchestration is much easier to observe.
  • Saga state must be persisted — if the orchestrator crashes mid-saga, it must be resumable.

Tooling: Temporal.io, AWS Step Functions, Axon Framework (Java), Saga state machines in your DB.


Outbox Pattern

Problem: Service A writes to its database AND publishes an event to Kafka. If the DB write succeeds but Kafka publish fails (or vice versa), you have inconsistency.

Solution: Write the event to an outbox table in the same database transaction as the business data. A separate relay process reads unprocessed outbox rows and publishes them.

BEGIN;
  INSERT INTO orders (id, status) VALUES (123, 'PLACED');
  INSERT INTO outbox (event_type, payload, processed) 
    VALUES ('ORDER_CREATED', '{"id": 123}', false);
COMMIT;
-- Both committed atomically, or neither committed

-- Separate process (or Debezium via CDC):
SELECT * FROM outbox WHERE processed = false ORDER BY created_at;
-- For each row: publish to Kafka, then mark processed = true

Key properties:

  • The business write and event publication are atomic
  • At-least-once delivery — if the relay crashes after publishing but before marking processed, it publishes again. Consumers must be idempotent.
  • CDC (Debezium) reading the outbox table eliminates the polling relay process — Debezium reacts to the DB change immediately

When to use: Any time you need to reliably publish events that correspond to database changes. Critical for event sourcing, notification systems, and service integration.


CQRS (Command Query Responsibility Segregation)

Problem: The data model optimized for writes (normalized, transactional) is not optimal for reads (denormalized, pre-aggregated). Complex reporting queries are slow on the write model.

Solution: Separate the write model (command side) from the read model (query side). They can use different data stores, different schemas, even different technologies.

Write side:          Read side:
Commands →           Events from write side →
  OrderService    →    OrderReadModel (projected view)
  (Postgres)           (Elasticsearch or separate Postgres table)

Query: "All orders for user X with product details"
→ hits denormalized read model → fast, no joins

CQRS doesn’t require event sourcing, though they’re often used together. CQRS just means: the model you write to is different from the model you read from.

When to use:

  • Complex domain with significantly different read and write patterns
  • Read performance requirements can’t be met with the write model
  • Multiple read representations needed (same data, different views for different consumers)
  • Audit/history requirements (pair with event sourcing)

The cost: Eventual consistency between write and read models. When you write, the read model is updated asynchronously — reads may see slightly stale data. Also: two models to maintain, synchronization logic to build and monitor.

CQRS is not the default. Most CRUD applications don’t need it. Introduce it when the read/write impedance mismatch is causing real problems.


Event Sourcing

Problem: Traditional systems store current state. You lose history — “how did we get here?” can’t be answered.

Solution: Store the sequence of events that led to the current state. Current state is derived by replaying events.

Events (the source of truth):
  1. OrderCreated { id: 1, items: [...] }
  2. ItemAdded    { item: "SKU-999" }
  3. Coupon Applied { code: "SAVE20" }
  4. OrderPlaced  { total: 80.00 }

Current state (derived by replaying events 1–4):
  Order { id: 1, status: PLACED, total: 80.00, coupon: "SAVE20", ... }

What event sourcing gives you:

  • Complete audit trail — not just current state, but every change and why
  • Time travel — replay to any point in time
  • Event replay for new consumers — add a new read model (analytics, cache) by replaying history
  • Debugging — reproduce any production issue by replaying events
  • Decoupling — consumers subscribe to events, not state changes

The costs:

  • Complexity. Querying current state requires event replay or maintaining snapshots. Simple “SELECT * FROM orders” doesn’t work.
  • Snapshots needed for large event histories — replaying 100,000 events to get current state is slow. Snapshots checkpoint state at intervals.
  • Schema evolution is hard. An event in the log from 3 years ago must still be interpretable today. Event upcasting required.
  • Not for everything. Most services don’t need this. Use it for domains where history, auditability, and replayability are first-class requirements (financial ledgers, order management, healthcare records).

API Gateway Pattern

Problem: Clients need to call multiple backend services. Logic for auth, rate limiting, routing, and request aggregation is duplicated across services.

Solution: A single entry point that handles cross-cutting concerns and routes to backend services.

Responsibilities:

  • Authentication and authorization (validate JWT, check scopes)
  • Rate limiting per client/API key
  • SSL termination
  • Request routing and load balancing
  • Response caching for GET requests
  • Protocol translation (REST to gRPC)
  • Request/response transformation
  • Observability (access logs, metrics per endpoint)

Tools: AWS API Gateway, Kong, Nginx, Envoy, Spring Cloud Gateway, Traefik.

Gotcha: Don’t put business logic in the API Gateway. It should be routing + cross-cutting concerns. If you’re writing conditional logic based on request body content in the gateway, that logic belongs in a service.


BFF (Backend for Frontend)

Problem: A mobile app and a web app have different data needs. The web app needs rich data; the mobile app needs lightweight responses. Building one API that serves both leads to over-fetching on mobile or under-fetching on web.

Solution: A dedicated backend service per frontend type — a BFF. Each BFF aggregates and shapes data from downstream services specifically for its frontend.

Mobile App → Mobile BFF → UserService, OrderService (aggregated, optimized for mobile)
Web App    → Web BFF   → UserService, OrderService, RecommendationService (rich, desktop-optimized)

The BFF is owned by the frontend team. They understand their data needs and can evolve their BFF independently. The backend services remain stable.

When BFF makes sense:

  • Meaningfully different data requirements across client types
  • Mobile performance is critical (minimize payload, reduce round trips)
  • Frontend team velocity is blocked by backend team changes

When it’s overkill:

  • The clients have nearly identical data needs
  • You have the team budget to own N BFF services (each BFF is an additional service to maintain)

Strangler Fig Pattern

Problem: You need to replace a legacy system (the “monolith”) but can’t do a big-bang rewrite.

Solution: Progressively route traffic for specific features from the old system to the new one. The old system “strangled” as more functionality moves out.

Phase 1: All traffic → Monolith
Phase 2: User auth traffic → New Auth Service; rest → Monolith
Phase 3: Order creation → New Order Service; rest → Monolith
...
Phase N: Monolith retired

Implementation: A facade layer (proxy, API gateway, or feature flag router) sits in front of both systems and routes based on the path, header, or user cohort.

Why it works: Each piece is a small, bounded migration. Each piece can be tested and validated independently. Rollback is flip the router back. No big bang cutover risk.


Sidecar / Service Mesh

Problem: Cross-cutting concerns (service discovery, mTLS, retries, metrics) are implemented in every service, in every language. Changing the retry policy requires updating 50 services.

Solution: A sidecar proxy runs alongside each service container. The proxy intercepts all network traffic and handles cross-cutting concerns transparently.

[Service Pod]
  ├── App container  (your code)
  └── Envoy sidecar  (handles mTLS, retries, circuit breaking, telemetry)

Service mesh (Istio, Linkerd): Orchestrates all sidecars with a control plane. Policy changes propagate to all sidecars without application deployments.

What services gain: mTLS, distributed tracing, circuit breaking, load balancing — all without a single line of application code.

The cost: Sidecar adds latency (~5ms per hop), memory (~50MB per pod), and operational complexity. Worth it at scale; may not be worth it for 3 services.


Bulkhead Pattern

Problem: A slow downstream dependency consumes all your threads or connections, starving other downstream calls.

Solution: Isolate each dependency into its own resource pool (thread pool or connection pool). A slow dependency only affects its own pool.

Without bulkhead:
  All 200 threads shared → SlowService consumes all 200 → FastService gets none → everything fails

With bulkhead:
  50 threads for SlowService → 150 threads for FastService
  SlowService degrades → FastService unaffected

In Java/Spring: Resilience4j @Bulkhead — configure semaphore or thread pool bulkhead per downstream service. Hystrix (deprecated) called these “thread pools.”

Combined with circuit breaker: Bulkhead limits concurrent calls; circuit breaker stops calls when failure rate is high. Used together, they prevent a failing dependency from cascading.