System Design: Fraud Detection System

Apr 7, 2026 6 minutes to read

Fraud detection sits at the intersection of real-time systems, machine learning serving, and operational decision-making. The design challenge is: you need fast decisions (before the transaction completes) with high accuracy (false positives cost customers, false negatives cost money). Here’s how to design for both.

Requirements

Functional:

Score every transaction for fraud risk before authorization (synchronous, < 200ms)
Async deeper analysis for flagged transactions (minutes to hours)
Rule-based engine: “block if transaction > $5000 AND new device AND new country”
ML scoring: multi-feature risk probability score
Case management: analysts review flagged cases, mark fraud/not fraud
Feedback loop: analyst decisions feed back into model training
Account takeover detection (ATO): suspicious login, device fingerprint, velocity

Non-functional:

Pre-auth decision in < 200ms p99 (synchronous path)
High availability — if fraud service is down, fail open (allow transaction) or fail closed?
False positive rate < 0.5% (1 in 200 legitimate transactions should not be blocked)
False negative rate acceptable up to ~0.1% (fraud loss budget)

Key Design Decisions

Fail Open vs Fail Closed

If the fraud service is unreachable, does the transaction proceed?

Fail open (allow): Revenue-first. Users aren’t blocked during outages. Fraud loss increases during downtime.
Fail closed (block): Safety-first. No fraud during outages. Revenue loss during downtime.

Most e-commerce systems fail open for availability. Accept that a fraud service outage increases fraud loss, rather than stopping all revenue. Use circuit breakers to detect degradation quickly and alert.

Exception: High-risk categories (crypto purchases, large transfers) may fail closed — the fraud cost exceeds the revenue cost of blocking.

Real-Time vs Async Scoring

Two tiers:

Tier 1 (synchronous, < 200ms): Fast rules + lightweight ML model. Must complete before the payment processor call. Purpose: block obvious fraud immediately.

Tier 2 (async, seconds to minutes): Deep ML analysis, network graph analysis, behavioral analysis. Runs after the transaction is authorized. Purpose: flag transactions for review, trigger holds, initiate disputes for post-settlement fraud.

The split is deliberate: a deep GBM model scoring 300 features in 200ms is possible but expensive. Keep the sync path lightweight; move complexity to the async path.

Architecture

Transaction Event (from payment service)
  │
  ├─── Sync path (pre-auth, < 200ms):
  │         │
  │    ┌────▼────────────────────────────────────┐
  │    │  Fraud Scoring Service                  │
  │    │  1. Feature extraction (from cache)     │
  │    │  2. Rule engine evaluation              │
  │    │  3. Lightweight ML model inference      │
  │    │  4. Return: ALLOW / REVIEW / BLOCK       │
  │    └────────────────────────────────────────┘
  │         │
  │    Feature Cache (Redis):
  │    - User transaction velocity (last 1h, 24h, 7d)
  │    - Device history
  │    - IP risk score (GeoIP + VPN detection)
  │    - Merchant category risk
  │
  └─── Async path (post-auth):
            │
       Kafka topic: transaction-events
            │
       ┌────▼────────────────────────────────────┐
       │  Deep Analysis Worker                   │
       │  1. Full ML model (300+ features)        │
       │  2. Graph analysis (account network)     │
       │  3. Behavioral profiling                 │
       │  4. Device fingerprint correlation       │
       └────┬────────────────────────────────────┘
            │
       ┌────▼────────────────┐
       │  Case Management    │  ← Analyst reviews
       │  (flagged txns)     │    FRAUD / NOT_FRAUD decision
       └────┬────────────────┘
            │
       Feedback → Model Training Pipeline

Feature Engineering for Fraud Detection

The quality of features determines model accuracy. Critical features:

Velocity features (require fast computation):

Transaction count in last 1/4/24 hours per user
Dollar amount in last 1/4/24 hours per user
Transaction count in last hour per card
Failed attempts in last hour

Device and location features:

Is this device seen before for this account?
Is this IP address associated with a VPN/proxy/Tor?
Geo-distance from last transaction (impossible travel: two transactions 5000 miles apart in 30 minutes)
New device + new country combination (very high risk signal)

Behavioral features:

Transaction at unusual hour for this user (1am when user typically transacts 9am-5pm)
Merchant category deviation from user’s history
Amount deviation (user typically spends $20–100, this is $2000)

Network features:

Is this email address / card / device linked to known fraud accounts?
Count of accounts sharing this device
Count of accounts sharing this IP in last 24 hours

Feature computation challenge: Velocity features (count in last hour) can’t be pre-computed cheaply. Use Redis ZADD + ZCOUNT on sorted sets with timestamps to compute rolling windows in O(log N):

ZADD user:123:tx_timestamps <timestamp_ms> <tx_id>
ZCOUNT user:123:tx_timestamps <1h_ago_ms> +inf

ML Model Serving

Model type: Gradient Boosting (XGBoost, LightGBM) is the industry standard for tabular fraud features. Performs better than deep learning for structured data with this feature profile.

Sync path model: Smaller, faster model. 20–50 features, inference < 20ms. Acceptable AUC 0.85.

Async path model: Full model. 300+ features including graph features. Inference 100ms–1s. Higher AUC.

Serving infrastructure:

Model stored in object storage (S3), versioned
Served via TensorFlow Serving, Triton Inference Server, or BentoML
Models hot-swapped without service restarts
A/B testing: route X% of traffic to new model, compare fraud rates before full cutover

Model drift monitoring: Fraud patterns change (fraudsters adapt). Monitor:

Feature distribution shift (PSI — Population Stability Index)
Model score distribution shift
False positive/negative rate over time

Retrain on a regular cadence + feedback from analyst labels.

False Positive Cost

Blocking a legitimate transaction is expensive:

Customer experience damage (frustrated user, potential churn)
Support cost (customer calls to dispute the block)
Revenue loss

The false positive / false negative trade-off:

Tighter threshold → fewer false negatives (catch more fraud) → more false positives (block legit users)
Looser threshold → fewer false positives → more false negatives (miss more fraud)

Optimizing the threshold: Set different thresholds per risk segment. High-risk merchants (crypto, gift cards) → tighter. Low-risk merchants → looser. Premium customers with long history → much looser (their fraud rate is lower and the business cost of blocking them is higher).

Soft declines vs hard declines:

Hard decline: transaction blocked outright
Soft decline with step-up: “We noticed unusual activity. Please verify via OTP” → user verifies → proceeds. Preserves revenue, reduces false positives at slight friction cost.

Feedback Loop

Analyst decisions are the training signal. If an analyst marks a REVIEW transaction as NOT_FRAUD, the model learns from this.

Important: Labels are delayed and noisy.

Chargebacks (confirmed fraud) arrive weeks after the transaction
Analyst decisions introduce human bias
Not all fraud is disputed (some users don’t notice small fraudulent charges)

Training data pipeline:

Transaction events → feature store
Labels from analyst decisions + chargebacks (delayed labels)
Model training job (weekly or on-demand)
Champion/challenger testing before production deployment

EM Talking Points

Why not just use rules? Rules are interpretable and fast but brittle. Fraudsters learn the rules and adapt. ML generalizes to new fraud patterns. Ideal: rules for known patterns + ML for novel patterns.
How do you detect account takeover vs payment fraud? ATO: behavioral signals at login (unusual device, IP, typing cadence). Payment fraud: signals at transaction time. Two models, two pipelines, shared feature infrastructure.
Velocity limits as a fraud signal: 10 failed card attempts in 5 minutes is a carding attack. This is a rule, not ML. Rules handle these obvious cases; ML handles the subtle ones.
Graph analysis: Fraudsters often reuse devices, IP addresses, email patterns across multiple accounts. Querying a graph of account-device-IP relationships reveals rings. Graph DB (Neo4j) or graph compute (Spark GraphX) for batch; in-memory graph for real-time.