nSkillHub
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

NoSQL Families: Choosing the Right Tool

NoSQL isn’t a single thing — it’s five different database families with fundamentally different data models, consistency guarantees, and use cases. Using the wrong family (or the wrong database within a family) is a common and costly mistake. Here’s how to think through each one.


Document Stores: MongoDB, DynamoDB, Firestore

Data model: Each record is a self-contained JSON-like document. Collections of documents, each with its own structure.

Strengths:

  • Natural fit for entities with variable structure (product catalog, CMS content, user profiles with optional fields)
  • Efficient reads when you need the whole entity (no joins — everything is in one document)
  • Flexible schema for rapid iteration

MongoDB:

  • Rich query language — you can query on any field, including nested fields
  • Aggregation pipeline for complex queries
  • Atlas search for full-text
  • ACID transactions across multiple documents (with overhead)
  • Good fit: content management, product catalogs, user profiles, applications needing flexible schema and rich querying

DynamoDB:

  • Fully managed, serverless, infinite scale with no ops
  • Single-digit millisecond latency at any scale
  • Massive limitation: you must design your access patterns upfront. You get a primary key + optional sort key, and Global Secondary Indexes (GSIs). Ad-hoc queries across arbitrary fields are painful or impossible.
  • Good fit: high-scale applications with well-defined, limited access patterns — session storage, leaderboards, IoT event data, gaming
  • The EM interview question: “Would you use DynamoDB for user account management?” — Depends on the queries. If it’s always “get user by ID,” fine. If you need “find all users who signed up in the last 30 days with email_verified = false,” you’re fighting DynamoDB.

MongoDB vs DynamoDB:

  • Need rich querying on arbitrary fields → MongoDB
  • Need infinite scale with no ops overhead + access patterns are known + AWS-native → DynamoDB
  • Need multi-region active-active with minimal ops → DynamoDB (Global Tables)

Key-Value Stores: Redis, DynamoDB (KV mode), Memcached

Data model: Pure lookup by key → value. The simplest possible model.

Redis:

  • In-memory with persistence options (RDB snapshots, AOF log)
  • Rich data structures: strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog
  • Sorted sets are the power feature: leaderboards, time-series, range queries, rate limiting
  • Pub/sub, Lua scripting, atomic operations (INCR, GETSET)
  • Redis Streams for event sourcing / lightweight message queue
  • Good fit: caching, session storage, rate limiting, leaderboards, real-time analytics, distributed locks, pub/sub

Memcached:

  • Pure LRU cache. No persistence, no rich types, simpler.
  • Slightly faster than Redis for pure cache workloads at extreme scale
  • Multi-threaded by design (Redis was single-threaded until Redis 6)
  • The honest truth: almost no new projects should choose Memcached over Redis. Redis does everything Memcached does and more.

Wide-Column Stores: Cassandra, ScyllaDB, HBase

Data model: Tables with rows identified by a partition key. Within a partition, rows are sorted by a clustering key. Partitions distribute across nodes.

Key properties:

  • Designed for extreme write throughput — writes are appended to commit log + memtable (sequential I/O, very fast)
  • Linear horizontal scalability — add nodes, get proportional throughput
  • Tunable consistency — write to any number of replicas (QUORUM, ALL, ONE)
  • No joins, no transactions across partitions
  • Schema must match your query patterns. You design tables for queries, not for normalization.

Cassandra:

  • Write-heavy workloads at scale: time-series data, IoT telemetry, activity logs, audit trails
  • Good for: “write 1M events/second, read the last 100 events for user X”
  • Bad for: ad-hoc queries, aggregations, data with evolving access patterns

ScyllaDB:

  • Drop-in Cassandra replacement written in C++ (vs Java). ~10x higher throughput per node, lower latency, lower operational overhead.
  • If you’re choosing Cassandra, seriously evaluate ScyllaDB first.

Cassandra vs DynamoDB for write-heavy time-series:

  • DynamoDB: no ops, scales automatically, but you pay per WCU and RCU (can get expensive at high volume), less control over data model
  • Cassandra/ScyllaDB: ops overhead but predictable cost at high volume, full control over partitioning strategy
  • At very high write volumes on AWS, DynamoDB becomes expensive faster than running ScyllaDB on EC2

Graph Databases: Neo4j, Amazon Neptune

Data model: Nodes (entities) and edges (relationships), each with properties.

The key insight: Graph databases are for queries where the relationships themselves are the primary data — not just what things are, but how they connect, through how many hops, in what path.

When they win:

  • Fraud detection: “Is this account connected to known fraudulent accounts within 3 hops?”
  • Social networks: “What’s the shortest path between user A and user B? Who do they know in common?”
  • Recommendation engines: “What products did people with similar purchase patterns buy?”
  • Knowledge graphs, dependency mapping, org chart traversal
  • Access control: “Does this user have permission to this resource through any role path?”

When they lose:

  • Simple entity storage with occasional relationship queries — a relational DB with proper indexes handles this fine
  • High-write-throughput scenarios — graph DBs prioritize relationship traversal, not bulk ingestion
  • Anything where your main query is “give me all nodes of type X” — that’s a table scan, not a graph query

The EM test: If you can frame your key queries as “traverse these relationships” and the relationship depth matters, a graph DB is worth evaluating. If your “graph” queries are just simple joins, stay relational.


Search Engines: Elasticsearch, OpenSearch, Solr

Data model: Inverted index. Documents indexed with full-text analysis, scored by relevance.

What they’re built for:

  • Full-text search with relevance ranking (BM25 algorithm)
  • Faceted search (filter by category AND price range AND brand simultaneously)
  • Aggregations and analytics over large datasets
  • Fuzzy matching, stemming, synonyms, autocomplete
  • Log aggregation and analysis (the “ELK stack” / “EFK stack” for Kubernetes logs)

Elasticsearch as primary store — when it works:

  • Product search where the read pattern is exclusively full-text + faceted search
  • Log/event data where you’re querying recent time windows

Elasticsearch as primary store — the risks:

  • No ACID. Documents are eventually visible after indexing.
  • Not suitable for transactional writes or consistent reads
  • Schema is set at index creation — reindexing is an expensive operation
  • At-scale cluster management is non-trivial (shard sizing, replication, JVM tuning)

The pattern: Use Elasticsearch/OpenSearch as a secondary index synced from your primary database (via CDC or dual-write). Your primary store is Postgres; you index the searchable fields into Elasticsearch for search queries. You lose a small amount of freshness but keep transactional integrity.

OpenSearch vs Elasticsearch: OpenSearch is the AWS-maintained fork after Elastic changed its license. If you’re on AWS and using managed search, OpenSearch Service is the natural choice. If self-hosting or on GCP, Elasticsearch is fine.


Decision Summary

If you need… Use
Variable-schema entities, rich queries MongoDB
Infinite scale, known access patterns, AWS-native DynamoDB
Caching, sessions, rate limiting, leaderboards Redis
Extreme write throughput, time-series, append-heavy Cassandra / ScyllaDB
Relationship traversal, fraud detection, social graph Neo4j / Neptune
Full-text search, faceted navigation, log analysis Elasticsearch / OpenSearch
Everything else (default) PostgreSQL

The most important rule: don’t add a database you don’t need. Every additional store is operational overhead, another thing to monitor, another failure point, another set of runbooks. The default should always be “can Postgres handle this?” The answer is often yes.