nSkillHub
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Messaging and Event-Driven Architecture: Kafka vs RabbitMQ vs SQS

The choice between a message queue and an event streaming platform shapes your architecture more than almost any other infrastructure decision. Getting it wrong means rebuilding — not reconfiguring. Here’s how to think through it.


Message Queue vs Event Streaming: The Fundamental Distinction

This distinction matters before you pick a product.

Message queue (RabbitMQ, SQS, ActiveMQ):

  • A message is a task or command for a consumer
  • Typically consumed once — it’s deleted after successful processing
  • Consumer drives the pace — pull or push, but once processed, it’s gone
  • Good for: work distribution, background job processing, decoupled command execution

Event streaming (Kafka, Kinesis, Google Pub/Sub):

  • An event is a fact — something that happened. It’s retained on the log.
  • Multiple independent consumers can read the same events at their own pace
  • The log is append-only and retained (configurable, but can be days/weeks/forever)
  • Good for: audit trail, replayability, multiple consumers with different read positions, event sourcing, CDC

The test question: “Do you need to replay events? Do multiple independent consumers need to process the same event for different purposes?” If yes, you need event streaming. If it’s just task distribution, a queue is simpler and sufficient.


Kafka: When to Use It

Kafka is the dominant event streaming platform. It’s designed for high-throughput, ordered, durable, replayable event logs.

Kafka wins when:

  • You have high write volume (millions of events/second)
  • Multiple consumers need to process the same events independently (analytics + order processing + fraud scoring all from the same order event)
  • You need replay — re-process historical events for a new consumer, replay after bug fix, backfill a new data store
  • You need exactly-ordered processing within a partition
  • Event sourcing — your system’s state is derived from the event log
  • CDC pipeline — database changes published as events

Kafka’s costs:

  • Operational complexity — Zookeeper (pre-3.3) or KRaft, broker sizing, partition count decisions, consumer group management, rebalancing, lag monitoring
  • Not a queue — consumer state (offset) is managed by the consumer. At-least-once delivery is the norm. Exactly-once is possible but requires transactional producers and idempotent consumers.
  • Partition count is set at topic creation — scaling partitions later requires rebalancing
  • Latency floor is ~5ms; not designed for ultra-low-latency use cases

“Your team wants to introduce Kafka — what questions do you ask?”

  1. What problem is Kafka solving that a simple queue or synchronous call doesn’t solve?
  2. Who will operate it? Do we have Kafka expertise or budget for managed Kafka (Confluent Cloud, MSK)?
  3. Do we need replayability / multiple consumers / high throughput, or just decoupling?
  4. What’s the schema evolution strategy for event payloads? (Avro + Schema Registry, Protobuf, JSON with versioning?)
  5. How will we monitor consumer lag and set alerts?
  6. What’s the data retention requirement?

RabbitMQ: When It’s the Right Tool

RabbitMQ is a traditional message broker: AMQP protocol, exchanges, queues, routing. Simpler to operate than Kafka, well-suited for work distribution.

RabbitMQ wins when:

  • You need sophisticated message routing (topic exchanges, header-based routing, dead letter queues)
  • You need per-message TTL and priority queues
  • Consumer-driven acknowledgement model is important (consume → process → ack/nack)
  • Lower throughput requirements (thousands/second, not millions)
  • You need complex queuing topologies
  • Work distribution where each message goes to exactly one consumer (competing consumers pattern)

RabbitMQ vs Kafka:

RabbitMQ Kafka
Model Message queue Event log
Consumers One consumer per message Multiple independent consumers
Replay No Yes
Throughput Thousands/sec Millions/sec
Retention Until consumed Configurable (time or size)
Routing Flexible (exchanges) Partition-based
Ops complexity Lower Higher
Best for Task distribution, work queues Event streaming, CDC, audit

SQS and SNS: The AWS Default

If you’re on AWS and don’t have strong reasons for self-hosted Kafka or RabbitMQ, SQS + SNS is the path of least resistance.

SQS Standard: At-least-once delivery, best-effort ordering. Simplest, highest throughput.

SQS FIFO: Exactly-once processing, strict ordering (within a message group). Max 3,000 messages/second per queue (with batching). Use when order matters (financial transactions, user command sequences).

SNS + SQS fan-out: SNS topic → multiple SQS queues. One event, multiple independent consumers. Approximates Kafka’s multi-consumer model for lower throughput cases.

Limitations vs Kafka:

  • No replay — messages are deleted after consumption (even in FIFO)
  • Max retention 14 days
  • No consumer offset management
  • Fan-out requires SNS topic + queue per consumer (more infrastructure)

When SQS is enough: Your use case is background jobs, async processing, simple work distribution, and you don’t need replay or multiple consumers reading the same event history.


Exactly-Once Semantics: Do You Actually Need It?

“Exactly-once” is often misunderstood. There are two levels:

  1. Exactly-once delivery: The message is delivered exactly once to the broker consumer. Kafka supports this with enable.idempotence=true + transactional.id.

  2. Exactly-once processing (end-to-end): The downstream effect of the message happens exactly once. This requires idempotent consumers — the same message processed twice produces the same result.

The honest answer: Exactly-once delivery is achievable. Exactly-once end-to-end semantics require idempotent consumers, which is a design requirement on your business logic. You cannot guarantee exactly-once without idempotent processing on the consumer side.

Practical approach: Design consumers to be idempotent (deduplicate by event ID), accept at-least-once delivery, and handle duplicates gracefully. This is simpler and more reliable than relying on transactional exactly-once, which has significant throughput overhead and operational complexity.


Synchronous REST vs Async Messaging: The Decision

This comes up for every service interaction. The framework:

Use synchronous REST/gRPC when:

  • The caller needs an immediate response with the result
  • The operation is quick (< a few hundred ms)
  • Failure should be surfaced immediately to the caller
  • The client needs to know if the operation succeeded before continuing
  • Example: “Is this user authorized?” — you need the answer now

Use async messaging when:

  • The operation is long-running or the caller doesn’t need immediate confirmation of completion
  • You want to decouple services so a downstream slowdown doesn’t propagate upstream
  • Multiple services need to react to the same event
  • The operation can be retried without user-visible impact
  • Example: “Order placed — trigger inventory reservation, email confirmation, fraud check” — all can happen async

Hybrid pattern (command + event): Accept a request synchronously (validate and persist), return a correlation ID, and process asynchronously. Client polls or receives a callback/webhook. Used in payment processing, video encoding, document generation.


Schema Evolution in Event Payloads

Events accumulate technical debt. A schema you can’t change without breaking consumers is a serious problem. Strategies:

1. Avro + Schema Registry (Confluent/Apicurio): Binary serialization with a central schema registry. Producers/consumers validate compatibility before publishing. Schema evolution rules enforced at write time: backward compatible (add optional fields), forward compatible (remove optional fields), fully compatible.

2. Protobuf: Binary, backward/forward compatible by design if you follow the rules (don’t reuse field numbers, mark removed fields reserved). Good if you already use gRPC.

3. JSON with versioning: Include a version or schemaVersion field. Consumers check and handle accordingly. Flexible but requires discipline — no enforcement at publish time.

4. Event versioning patterns:

  • Same topic, versioned field: { "version": 2, ... }. Simple but consumers must handle multiple versions.
  • Separate topics per version: orders-v1, orders-v2. Clean isolation but proliferates topics.
  • Upcasting: Consumer converts v1 events to v2 format at read time. Good for replay scenarios.

EM stance: Enforce schema compatibility programmatically from day one. An ad-hoc JSON schema without enforcement will break consumers within 6 months of the first “quick change.”