Messaging and Event-Driven Architecture: Kafka vs RabbitMQ vs SQS
The choice between a message queue and an event streaming platform shapes your architecture more than almost any other infrastructure decision. Getting it wrong means rebuilding — not reconfiguring. Here’s how to think through it.
This distinction matters before you pick a product.
Message queue (RabbitMQ, SQS, ActiveMQ):
- A message is a task or command for a consumer
- Typically consumed once — it’s deleted after successful processing
- Consumer drives the pace — pull or push, but once processed, it’s gone
- Good for: work distribution, background job processing, decoupled command execution
Event streaming (Kafka, Kinesis, Google Pub/Sub):
- An event is a fact — something that happened. It’s retained on the log.
- Multiple independent consumers can read the same events at their own pace
- The log is append-only and retained (configurable, but can be days/weeks/forever)
- Good for: audit trail, replayability, multiple consumers with different read positions, event sourcing, CDC
The test question: “Do you need to replay events? Do multiple independent consumers need to process the same event for different purposes?” If yes, you need event streaming. If it’s just task distribution, a queue is simpler and sufficient.
Kafka is the dominant event streaming platform. It’s designed for high-throughput, ordered, durable, replayable event logs.
Kafka wins when:
- You have high write volume (millions of events/second)
- Multiple consumers need to process the same events independently (analytics + order processing + fraud scoring all from the same order event)
- You need replay — re-process historical events for a new consumer, replay after bug fix, backfill a new data store
- You need exactly-ordered processing within a partition
- Event sourcing — your system’s state is derived from the event log
- CDC pipeline — database changes published as events
Kafka’s costs:
- Operational complexity — Zookeeper (pre-3.3) or KRaft, broker sizing, partition count decisions, consumer group management, rebalancing, lag monitoring
- Not a queue — consumer state (offset) is managed by the consumer. At-least-once delivery is the norm. Exactly-once is possible but requires transactional producers and idempotent consumers.
- Partition count is set at topic creation — scaling partitions later requires rebalancing
- Latency floor is ~5ms; not designed for ultra-low-latency use cases
“Your team wants to introduce Kafka — what questions do you ask?”
- What problem is Kafka solving that a simple queue or synchronous call doesn’t solve?
- Who will operate it? Do we have Kafka expertise or budget for managed Kafka (Confluent Cloud, MSK)?
- Do we need replayability / multiple consumers / high throughput, or just decoupling?
- What’s the schema evolution strategy for event payloads? (Avro + Schema Registry, Protobuf, JSON with versioning?)
- How will we monitor consumer lag and set alerts?
- What’s the data retention requirement?
RabbitMQ is a traditional message broker: AMQP protocol, exchanges, queues, routing. Simpler to operate than Kafka, well-suited for work distribution.
RabbitMQ wins when:
- You need sophisticated message routing (topic exchanges, header-based routing, dead letter queues)
- You need per-message TTL and priority queues
- Consumer-driven acknowledgement model is important (consume → process → ack/nack)
- Lower throughput requirements (thousands/second, not millions)
- You need complex queuing topologies
- Work distribution where each message goes to exactly one consumer (competing consumers pattern)
RabbitMQ vs Kafka:
| RabbitMQ | Kafka | |
|---|---|---|
| Model | Message queue | Event log |
| Consumers | One consumer per message | Multiple independent consumers |
| Replay | No | Yes |
| Throughput | Thousands/sec | Millions/sec |
| Retention | Until consumed | Configurable (time or size) |
| Routing | Flexible (exchanges) | Partition-based |
| Ops complexity | Lower | Higher |
| Best for | Task distribution, work queues | Event streaming, CDC, audit |
If you’re on AWS and don’t have strong reasons for self-hosted Kafka or RabbitMQ, SQS + SNS is the path of least resistance.
SQS Standard: At-least-once delivery, best-effort ordering. Simplest, highest throughput.
SQS FIFO: Exactly-once processing, strict ordering (within a message group). Max 3,000 messages/second per queue (with batching). Use when order matters (financial transactions, user command sequences).
SNS + SQS fan-out: SNS topic → multiple SQS queues. One event, multiple independent consumers. Approximates Kafka’s multi-consumer model for lower throughput cases.
Limitations vs Kafka:
- No replay — messages are deleted after consumption (even in FIFO)
- Max retention 14 days
- No consumer offset management
- Fan-out requires SNS topic + queue per consumer (more infrastructure)
When SQS is enough: Your use case is background jobs, async processing, simple work distribution, and you don’t need replay or multiple consumers reading the same event history.
“Exactly-once” is often misunderstood. There are two levels:
-
Exactly-once delivery: The message is delivered exactly once to the broker consumer. Kafka supports this with
enable.idempotence=true+transactional.id. -
Exactly-once processing (end-to-end): The downstream effect of the message happens exactly once. This requires idempotent consumers — the same message processed twice produces the same result.
The honest answer: Exactly-once delivery is achievable. Exactly-once end-to-end semantics require idempotent consumers, which is a design requirement on your business logic. You cannot guarantee exactly-once without idempotent processing on the consumer side.
Practical approach: Design consumers to be idempotent (deduplicate by event ID), accept at-least-once delivery, and handle duplicates gracefully. This is simpler and more reliable than relying on transactional exactly-once, which has significant throughput overhead and operational complexity.
This comes up for every service interaction. The framework:
Use synchronous REST/gRPC when:
- The caller needs an immediate response with the result
- The operation is quick (< a few hundred ms)
- Failure should be surfaced immediately to the caller
- The client needs to know if the operation succeeded before continuing
- Example: “Is this user authorized?” — you need the answer now
Use async messaging when:
- The operation is long-running or the caller doesn’t need immediate confirmation of completion
- You want to decouple services so a downstream slowdown doesn’t propagate upstream
- Multiple services need to react to the same event
- The operation can be retried without user-visible impact
- Example: “Order placed — trigger inventory reservation, email confirmation, fraud check” — all can happen async
Hybrid pattern (command + event): Accept a request synchronously (validate and persist), return a correlation ID, and process asynchronously. Client polls or receives a callback/webhook. Used in payment processing, video encoding, document generation.
Events accumulate technical debt. A schema you can’t change without breaking consumers is a serious problem. Strategies:
1. Avro + Schema Registry (Confluent/Apicurio): Binary serialization with a central schema registry. Producers/consumers validate compatibility before publishing. Schema evolution rules enforced at write time: backward compatible (add optional fields), forward compatible (remove optional fields), fully compatible.
2. Protobuf: Binary, backward/forward compatible by design if you follow the rules (don’t reuse field numbers, mark removed fields reserved). Good if you already use gRPC.
3. JSON with versioning: Include a version or schemaVersion field. Consumers check and handle accordingly. Flexible but requires discipline — no enforcement at publish time.
4. Event versioning patterns:
- Same topic, versioned field:
{ "version": 2, ... }. Simple but consumers must handle multiple versions. - Separate topics per version:
orders-v1,orders-v2. Clean isolation but proliferates topics. - Upcasting: Consumer converts v1 events to v2 format at read time. Good for replay scenarios.
EM stance: Enforce schema compatibility programmatically from day one. An ad-hoc JSON schema without enforcement will break consumers within 6 months of the first “quick change.”