Skip to main content
  1. Posts/
  2. System Design Basics/

Microservices Patterns: Saga, CQRS, Event Sourcing, BFF, and More

·8 mins·
Lakshay Jawa
Author
Lakshay Jawa
Sharing knowledge on system design, Java, Spring, and software engineering best practices.
Table of Contents

Microservices patterns are the vocabulary of distributed systems design. Knowing when to apply each one — and when not to — separates an architect who reads pattern books from one who’s shipped production systems.


Saga Pattern
#

Problem: A business transaction spans multiple services, each with its own database. You can’t use a distributed ACID transaction.

Solution: A saga is a sequence of local transactions. Each step publishes an event or triggers the next step. If a step fails, compensating transactions undo previous steps.

Choreography-based saga: Services react to events — no central coordinator.

1. OrderService: creates order → publishes OrderCreated
2. InventoryService: listens → reserves stock → publishes StockReserved
3. PaymentService: listens → charges card → publishes PaymentCompleted
4. OrderService: listens → confirms order

Failure at step 3:
3. PaymentService: charge fails → publishes PaymentFailed
2. InventoryService: listens → releases reservation → publishes StockReleased
1. OrderService: listens → cancels order

Orchestration-based saga: A saga orchestrator (a service or workflow engine) explicitly coordinates each step.

SagaOrchestrator:
  step 1: call InventoryService.reserve() → success
  step 2: call PaymentService.charge()   → fails
  step 3: call InventoryService.release() (compensate)
  → return failure

When to use which:

  • Choreography: fewer services, loose coupling desired, simple failure paths
  • Orchestration: many services, complex failure compensation, need visibility into saga state

Real pitfalls:

  • Compensating transactions must be idempotent. The network might redeliver a compensation event.
  • Partial failures are hard to reason about. What if the compensation itself fails?
  • Visibility: Where is the saga in its lifecycle? Orchestration is much easier to observe.
  • Saga state must be persisted — if the orchestrator crashes mid-saga, it must be resumable.

Tooling: Temporal.io, AWS Step Functions, Axon Framework (Java), Saga state machines in your DB.


Outbox Pattern
#

Problem: Service A writes to its database AND publishes an event to Kafka. If the DB write succeeds but Kafka publish fails (or vice versa), you have inconsistency.

Solution: Write the event to an outbox table in the same database transaction as the business data. A separate relay process reads unprocessed outbox rows and publishes them.

BEGIN;
  INSERT INTO orders (id, status) VALUES (123, 'PLACED');
  INSERT INTO outbox (event_type, payload, processed) 
    VALUES ('ORDER_CREATED', '{"id": 123}', false);
COMMIT;
-- Both committed atomically, or neither committed

-- Separate process (or Debezium via CDC):
SELECT * FROM outbox WHERE processed = false ORDER BY created_at;
-- For each row: publish to Kafka, then mark processed = true

Key properties:

  • The business write and event publication are atomic
  • At-least-once delivery — if the relay crashes after publishing but before marking processed, it publishes again. Consumers must be idempotent.
  • CDC (Debezium) reading the outbox table eliminates the polling relay process — Debezium reacts to the DB change immediately

When to use: Any time you need to reliably publish events that correspond to database changes. Critical for event sourcing, notification systems, and service integration.


CQRS (Command Query Responsibility Segregation)
#

Problem: The data model optimized for writes (normalized, transactional) is not optimal for reads (denormalized, pre-aggregated). Complex reporting queries are slow on the write model.

Solution: Separate the write model (command side) from the read model (query side). They can use different data stores, different schemas, even different technologies.

Write side:          Read side:
Commands →           Events from write side →
  OrderService    →    OrderReadModel (projected view)
  (Postgres)           (Elasticsearch or separate Postgres table)

Query: "All orders for user X with product details"
→ hits denormalized read model → fast, no joins

CQRS doesn’t require event sourcing, though they’re often used together. CQRS just means: the model you write to is different from the model you read from.

When to use:

  • Complex domain with significantly different read and write patterns
  • Read performance requirements can’t be met with the write model
  • Multiple read representations needed (same data, different views for different consumers)
  • Audit/history requirements (pair with event sourcing)

The cost: Eventual consistency between write and read models. When you write, the read model is updated asynchronously — reads may see slightly stale data. Also: two models to maintain, synchronization logic to build and monitor.

CQRS is not the default. Most CRUD applications don’t need it. Introduce it when the read/write impedance mismatch is causing real problems.


Event Sourcing
#

Problem: Traditional systems store current state. You lose history — “how did we get here?” can’t be answered.

Solution: Store the sequence of events that led to the current state. Current state is derived by replaying events.

Events (the source of truth):
  1. OrderCreated { id: 1, items: [...] }
  2. ItemAdded    { item: "SKU-999" }
  3. Coupon Applied { code: "SAVE20" }
  4. OrderPlaced  { total: 80.00 }

Current state (derived by replaying events 1–4):
  Order { id: 1, status: PLACED, total: 80.00, coupon: "SAVE20", ... }

What event sourcing gives you:

  • Complete audit trail — not just current state, but every change and why
  • Time travel — replay to any point in time
  • Event replay for new consumers — add a new read model (analytics, cache) by replaying history
  • Debugging — reproduce any production issue by replaying events
  • Decoupling — consumers subscribe to events, not state changes

The costs:

  • Complexity. Querying current state requires event replay or maintaining snapshots. Simple “SELECT * FROM orders” doesn’t work.
  • Snapshots needed for large event histories — replaying 100,000 events to get current state is slow. Snapshots checkpoint state at intervals.
  • Schema evolution is hard. An event in the log from 3 years ago must still be interpretable today. Event upcasting required.
  • Not for everything. Most services don’t need this. Use it for domains where history, auditability, and replayability are first-class requirements (financial ledgers, order management, healthcare records).

API Gateway Pattern
#

Problem: Clients need to call multiple backend services. Logic for auth, rate limiting, routing, and request aggregation is duplicated across services.

Solution: A single entry point that handles cross-cutting concerns and routes to backend services.

Responsibilities:

  • Authentication and authorization (validate JWT, check scopes)
  • Rate limiting per client/API key
  • SSL termination
  • Request routing and load balancing
  • Response caching for GET requests
  • Protocol translation (REST to gRPC)
  • Request/response transformation
  • Observability (access logs, metrics per endpoint)

Tools: AWS API Gateway, Kong, Nginx, Envoy, Spring Cloud Gateway, Traefik.

Gotcha: Don’t put business logic in the API Gateway. It should be routing + cross-cutting concerns. If you’re writing conditional logic based on request body content in the gateway, that logic belongs in a service.


BFF (Backend for Frontend)
#

Problem: A mobile app and a web app have different data needs. The web app needs rich data; the mobile app needs lightweight responses. Building one API that serves both leads to over-fetching on mobile or under-fetching on web.

Solution: A dedicated backend service per frontend type — a BFF. Each BFF aggregates and shapes data from downstream services specifically for its frontend.

Mobile App → Mobile BFF → UserService, OrderService (aggregated, optimized for mobile)
Web App    → Web BFF   → UserService, OrderService, RecommendationService (rich, desktop-optimized)

The BFF is owned by the frontend team. They understand their data needs and can evolve their BFF independently. The backend services remain stable.

When BFF makes sense:

  • Meaningfully different data requirements across client types
  • Mobile performance is critical (minimize payload, reduce round trips)
  • Frontend team velocity is blocked by backend team changes

When it’s overkill:

  • The clients have nearly identical data needs
  • You have the team budget to own N BFF services (each BFF is an additional service to maintain)

Strangler Fig Pattern
#

Problem: You need to replace a legacy system (the “monolith”) but can’t do a big-bang rewrite.

Solution: Progressively route traffic for specific features from the old system to the new one. The old system “strangled” as more functionality moves out.

Phase 1: All traffic → Monolith
Phase 2: User auth traffic → New Auth Service; rest → Monolith
Phase 3: Order creation → New Order Service; rest → Monolith
...
Phase N: Monolith retired

Implementation: A facade layer (proxy, API gateway, or feature flag router) sits in front of both systems and routes based on the path, header, or user cohort.

Why it works: Each piece is a small, bounded migration. Each piece can be tested and validated independently. Rollback is flip the router back. No big bang cutover risk.


Sidecar / Service Mesh
#

Problem: Cross-cutting concerns (service discovery, mTLS, retries, metrics) are implemented in every service, in every language. Changing the retry policy requires updating 50 services.

Solution: A sidecar proxy runs alongside each service container. The proxy intercepts all network traffic and handles cross-cutting concerns transparently.

[Service Pod]
  ├── App container  (your code)
  └── Envoy sidecar  (handles mTLS, retries, circuit breaking, telemetry)

Service mesh (Istio, Linkerd): Orchestrates all sidecars with a control plane. Policy changes propagate to all sidecars without application deployments.

What services gain: mTLS, distributed tracing, circuit breaking, load balancing — all without a single line of application code.

The cost: Sidecar adds latency (~5ms per hop), memory (~50MB per pod), and operational complexity. Worth it at scale; may not be worth it for 3 services.


Bulkhead Pattern
#

Problem: A slow downstream dependency consumes all your threads or connections, starving other downstream calls.

Solution: Isolate each dependency into its own resource pool (thread pool or connection pool). A slow dependency only affects its own pool.

Without bulkhead:
  All 200 threads shared → SlowService consumes all 200 → FastService gets none → everything fails

With bulkhead:
  50 threads for SlowService → 150 threads for FastService
  SlowService degrades → FastService unaffected

In Java/Spring: Resilience4j @Bulkhead — configure semaphore or thread pool bulkhead per downstream service. Hystrix (deprecated) called these “thread pools.”

Combined with circuit breaker: Bulkhead limits concurrent calls; circuit breaker stops calls when failure rate is high. Used together, they prevent a failing dependency from cascading.