Messaging and Event-Driven Architecture: Kafka vs RabbitMQ vs SQS

Apr 7, 2026 6 minutes to read

The choice between a message queue and an event streaming platform shapes your architecture more than almost any other infrastructure decision. Getting it wrong means rebuilding — not reconfiguring. Here’s how to think through it.

Message Queue vs Event Streaming: The Fundamental Distinction

This distinction matters before you pick a product.

Message queue (RabbitMQ, SQS, ActiveMQ):

A message is a task or command for a consumer
Typically consumed once — it’s deleted after successful processing
Consumer drives the pace — pull or push, but once processed, it’s gone
Good for: work distribution, background job processing, decoupled command execution

Event streaming (Kafka, Kinesis, Google Pub/Sub):

An event is a fact — something that happened. It’s retained on the log.
Multiple independent consumers can read the same events at their own pace
The log is append-only and retained (configurable, but can be days/weeks/forever)
Good for: audit trail, replayability, multiple consumers with different read positions, event sourcing, CDC

The test question: “Do you need to replay events? Do multiple independent consumers need to process the same event for different purposes?” If yes, you need event streaming. If it’s just task distribution, a queue is simpler and sufficient.

Kafka: When to Use It

Kafka is the dominant event streaming platform. It’s designed for high-throughput, ordered, durable, replayable event logs.

Kafka wins when:

You have high write volume (millions of events/second)
Multiple consumers need to process the same events independently (analytics + order processing + fraud scoring all from the same order event)
You need replay — re-process historical events for a new consumer, replay after bug fix, backfill a new data store
You need exactly-ordered processing within a partition
Event sourcing — your system’s state is derived from the event log
CDC pipeline — database changes published as events

Kafka’s costs:

Operational complexity — Zookeeper (pre-3.3) or KRaft, broker sizing, partition count decisions, consumer group management, rebalancing, lag monitoring
Not a queue — consumer state (offset) is managed by the consumer. At-least-once delivery is the norm. Exactly-once is possible but requires transactional producers and idempotent consumers.
Partition count is set at topic creation — scaling partitions later requires rebalancing
Latency floor is ~5ms; not designed for ultra-low-latency use cases

“Your team wants to introduce Kafka — what questions do you ask?”

What problem is Kafka solving that a simple queue or synchronous call doesn’t solve?
Who will operate it? Do we have Kafka expertise or budget for managed Kafka (Confluent Cloud, MSK)?
Do we need replayability / multiple consumers / high throughput, or just decoupling?
What’s the schema evolution strategy for event payloads? (Avro + Schema Registry, Protobuf, JSON with versioning?)
How will we monitor consumer lag and set alerts?
What’s the data retention requirement?

RabbitMQ: When It’s the Right Tool

RabbitMQ is a traditional message broker: AMQP protocol, exchanges, queues, routing. Simpler to operate than Kafka, well-suited for work distribution.

RabbitMQ wins when:

You need sophisticated message routing (topic exchanges, header-based routing, dead letter queues)
You need per-message TTL and priority queues
Consumer-driven acknowledgement model is important (consume → process → ack/nack)
Lower throughput requirements (thousands/second, not millions)
You need complex queuing topologies
Work distribution where each message goes to exactly one consumer (competing consumers pattern)

RabbitMQ vs Kafka:

	RabbitMQ	Kafka
Model	Message queue	Event log
Consumers	One consumer per message	Multiple independent consumers
Replay	No	Yes
Throughput	Thousands/sec	Millions/sec
Retention	Until consumed	Configurable (time or size)
Routing	Flexible (exchanges)	Partition-based
Ops complexity	Lower	Higher
Best for	Task distribution, work queues	Event streaming, CDC, audit

If you’re on AWS and don’t have strong reasons for self-hosted Kafka or RabbitMQ, SQS + SNS is the path of least resistance.

SQS Standard: At-least-once delivery, best-effort ordering. Simplest, highest throughput.

SQS FIFO: Exactly-once processing, strict ordering (within a message group). Max 3,000 messages/second per queue (with batching). Use when order matters (financial transactions, user command sequences).

SNS + SQS fan-out: SNS topic → multiple SQS queues. One event, multiple independent consumers. Approximates Kafka’s multi-consumer model for lower throughput cases.

Limitations vs Kafka:

No replay — messages are deleted after consumption (even in FIFO)
Max retention 14 days
No consumer offset management
Fan-out requires SNS topic + queue per consumer (more infrastructure)

When SQS is enough: Your use case is background jobs, async processing, simple work distribution, and you don’t need replay or multiple consumers reading the same event history.

Exactly-Once Semantics: Do You Actually Need It?

“Exactly-once” is often misunderstood. There are two levels:

Exactly-once delivery: The message is delivered exactly once to the broker consumer. Kafka supports this with enable.idempotence=true + transactional.id.
Exactly-once processing (end-to-end): The downstream effect of the message happens exactly once. This requires idempotent consumers — the same message processed twice produces the same result.

The honest answer: Exactly-once delivery is achievable. Exactly-once end-to-end semantics require idempotent consumers, which is a design requirement on your business logic. You cannot guarantee exactly-once without idempotent processing on the consumer side.

Practical approach: Design consumers to be idempotent (deduplicate by event ID), accept at-least-once delivery, and handle duplicates gracefully. This is simpler and more reliable than relying on transactional exactly-once, which has significant throughput overhead and operational complexity.

Synchronous REST vs Async Messaging: The Decision

This comes up for every service interaction. The framework:

Use synchronous REST/gRPC when:

The caller needs an immediate response with the result
The operation is quick (< a few hundred ms)
Failure should be surfaced immediately to the caller
The client needs to know if the operation succeeded before continuing
Example: “Is this user authorized?” — you need the answer now

Use async messaging when:

The operation is long-running or the caller doesn’t need immediate confirmation of completion
You want to decouple services so a downstream slowdown doesn’t propagate upstream
Multiple services need to react to the same event
The operation can be retried without user-visible impact
Example: “Order placed — trigger inventory reservation, email confirmation, fraud check” — all can happen async

Hybrid pattern (command + event): Accept a request synchronously (validate and persist), return a correlation ID, and process asynchronously. Client polls or receives a callback/webhook. Used in payment processing, video encoding, document generation.

Schema Evolution in Event Payloads

Events accumulate technical debt. A schema you can’t change without breaking consumers is a serious problem. Strategies:

1. Avro + Schema Registry (Confluent/Apicurio): Binary serialization with a central schema registry. Producers/consumers validate compatibility before publishing. Schema evolution rules enforced at write time: backward compatible (add optional fields), forward compatible (remove optional fields), fully compatible.

2. Protobuf: Binary, backward/forward compatible by design if you follow the rules (don’t reuse field numbers, mark removed fields reserved). Good if you already use gRPC.

3. JSON with versioning: Include a version or schemaVersion field. Consumers check and handle accordingly. Flexible but requires discipline — no enforcement at publish time.

4. Event versioning patterns:

Same topic, versioned field: { "version": 2, ... }. Simple but consumers must handle multiple versions.
Separate topics per version: orders-v1, orders-v2. Clean isolation but proliferates topics.
Upcasting: Consumer converts v1 events to v2 format at read time. Good for replay scenarios.

EM stance: Enforce schema compatibility programmatically from day one. An ad-hoc JSON schema without enforcement will break consumers within 6 months of the first “quick change.”