Data Pipeline and Analytics: OLTP vs OLAP, Batch vs Streaming, CDC
As systems grow, the gap between operational data (what your application uses to run) and analytical data (what your business uses to make decisions) becomes significant. Understanding how to design data pipelines that bridge this gap is an EM-level concern.
OLTP (Online Transaction Processing):
- Handles operational workload — your application’s reads and writes
- Optimized for: fast, low-latency reads and writes on individual rows or small sets
- Schema design: normalized (3NF) to minimize write anomalies
- Example queries: “Get user #12345”, “Insert new order”, “Update inventory for SKU ABC”
- Database: PostgreSQL, MySQL, DynamoDB
OLAP (Online Analytical Processing):
- Handles analytical workload — reporting, BI dashboards, data science
- Optimized for: fast reads across large datasets (millions/billions of rows), aggregations, GROUP BY, JOINs across large tables
- Schema design: denormalized (star schema, wide tables) to minimize JOIN cost at query time
- Example queries: “Revenue by country by week for the last 2 years”, “Cohort retention analysis”, “Funnel conversion rates”
- Database: BigQuery, Snowflake, Redshift, Databricks, ClickHouse
Why they don’t mix:
A complex analytics query (SELECT country, SUM(revenue) FROM orders JOIN users ... GROUP BY country) running on your OLTP database will hold locks, saturate I/O, and compete with your transactional workload. Running analytical queries on your production DB is a common early-stage pattern that breaks as the system scales.
Production read replica — acceptable when:
- Team is small, data volume is manageable (< tens of millions of rows)
- Analytical queries are infrequent and run off-hours
- The replica runs on a separate instance from the primary (doesn’t affect production reads)
- Query complexity is moderate — no multi-minute scans
Data warehouse needed when:
- Analytical queries take minutes and are run frequently (by multiple analysts/BI tools)
- You need to join data from multiple systems (orders from Postgres + events from Kafka + CRM from Salesforce)
- Historical data exceeds what fits efficiently in the OLTP database
- You need isolation — analytics should never touch production infrastructure
- Data must be transformed before use (cleansing, enrichment, aggregation)
The data warehouse as a separate system: Data is extracted from operational systems, transformed, and loaded (ETL) or loaded then transformed (ELT). The warehouse has its own schema optimized for analytics. Analysts and BI tools query the warehouse, never production.
Batch processing: Process a large dataset in bulk, on a schedule. ETL jobs that run nightly, weekly aggregations, end-of-day reports.
When batch is right:
- The business insight doesn’t require real-time freshness (daily reports, weekly metrics)
- Processing is too expensive to run continuously (complex ML feature computation)
- The data volume is too large to process incrementally without windowing
- Idempotent: easy to re-run if it fails
Tools: Spark, Flink (batch mode), dbt (SQL transforms), Airflow/Prefect for orchestration.
Streaming processing: Process events as they arrive. A Kafka consumer reads events, applies logic, outputs results — latency measured in seconds, not hours.
When streaming is right:
- Real-time dashboards (fraud alerts, system monitoring, live metrics)
- Event-driven business logic that must react quickly (inventory reservation, fraud detection, real-time recommendations)
- Continuous aggregations (rolling window metrics: “orders in the last 5 minutes”)
- Notification/alerting systems
Tools: Apache Flink, Kafka Streams, Spark Structured Streaming, Apache Samza.
The streaming complexity cost: Exactly-once semantics, stateful stream processing, out-of-order event handling, watermarking for late events, checkpoint/state management — streaming requires expertise that batch doesn’t. Don’t use streaming “because it’s modern.” Use it when freshness requirements genuinely justify the complexity.
Lambda architecture (batch + streaming): Run both a batch layer (high accuracy, complete historical data) and a speed layer (real-time approximation). Results are merged. The goal: accuracy of batch, freshness of streaming. The cost: you maintain two systems, two code paths. Kappa architecture (streaming only) reduces this by making streaming the sole layer, reprocessing historical data via replay.
CDC captures the changes in a database (INSERT, UPDATE, DELETE) and publishes them as a stream of events. Instead of polling the database for changes, you receive them in real-time via the transaction log.
How it works (Postgres example):
- Postgres writes every change to its Write-Ahead Log (WAL)
- Debezium (the most popular CDC tool) reads the WAL via replication slot
- Changes are published as events to Kafka
- Consumers read from Kafka and react to the changes
Postgres transaction → WAL → Debezium → Kafka Topic → Consumer
INSERT into orders → → → {"op":"c", "after": {"id":1, "status":"PLACED"}}
Why CDC instead of dual-write (writing to both DB and Kafka)? Dual-write has a race condition — the DB write and the Kafka publish are not atomic. If the app crashes between them, you get inconsistency. CDC derives the event from the committed DB change — it only fires after the transaction commits. Guaranteed consistency.
CDC use cases:
- Event-driven microservices: Service B reacts to changes in Service A’s database without Service A sending explicit events. Reduces coupling.
- Data replication: Sync data from Postgres to Elasticsearch for search, Redis for cache, BigQuery for analytics — all via CDC pipeline.
- Audit trail: Every change to important entities captured without modifying application code.
- Cache invalidation: When a DB row changes, publish an event → cache consumer invalidates or updates the cache entry. Solves the dual-write invalidation problem.
Operational considerations:
- Replication slots have backlog risk — if Debezium is down, the WAL replication slot accumulates. Postgres must keep WAL until the slot is consumed. Large backlogs can fill disk.
- Schema evolution — when you add a column to Postgres, the CDC schema must adapt. Avro schema registry handles this well.
- Ordering guarantees — within a partition, events are ordered. Across partitions, they’re not. Design consumers to handle out-of-order events for different entities.
Alternatives to Debezium: AWS DMS (for RDS to Kafka/Kinesis), Google Datastream (GCP), Striim.
For context, the modern data engineering stack looks like:
Operational DBs (Postgres, MySQL, DynamoDB)
↓ CDC (Debezium) or batch extract (Airbyte, Fivetran)
Kafka / Event Stream
↓
Data Warehouse (BigQuery, Snowflake, Redshift)
↓ Transform (dbt — SQL-based transformations)
BI Layer (Looker, Metabase, Mode)
↓
Dashboards / Reports
EM-level framing: When a product manager asks “why don’t we have this analytics report?” the answer often involves one of these layers. Was the data never captured? Is it in the OLTP DB but not the warehouse? Is it in the warehouse but not transformed? Is it transformed but not surfaced in the BI tool? Understanding the stack helps you diagnose data availability problems and have informed conversations with data engineering teams.