System Design: Fraud Detection System
Fraud detection sits at the intersection of real-time systems, machine learning serving, and operational decision-making. The design challenge is: you need fast decisions (before the transaction completes) with high accuracy (false positives cost customers, false negatives cost money). Here’s how to design for both.
Functional:
- Score every transaction for fraud risk before authorization (synchronous, < 200ms)
- Async deeper analysis for flagged transactions (minutes to hours)
- Rule-based engine: “block if transaction > $5000 AND new device AND new country”
- ML scoring: multi-feature risk probability score
- Case management: analysts review flagged cases, mark fraud/not fraud
- Feedback loop: analyst decisions feed back into model training
- Account takeover detection (ATO): suspicious login, device fingerprint, velocity
Non-functional:
- Pre-auth decision in < 200ms p99 (synchronous path)
- High availability — if fraud service is down, fail open (allow transaction) or fail closed?
- False positive rate < 0.5% (1 in 200 legitimate transactions should not be blocked)
- False negative rate acceptable up to ~0.1% (fraud loss budget)
If the fraud service is unreachable, does the transaction proceed?
- Fail open (allow): Revenue-first. Users aren’t blocked during outages. Fraud loss increases during downtime.
- Fail closed (block): Safety-first. No fraud during outages. Revenue loss during downtime.
Most e-commerce systems fail open for availability. Accept that a fraud service outage increases fraud loss, rather than stopping all revenue. Use circuit breakers to detect degradation quickly and alert.
Exception: High-risk categories (crypto purchases, large transfers) may fail closed — the fraud cost exceeds the revenue cost of blocking.
Two tiers:
Tier 1 (synchronous, < 200ms): Fast rules + lightweight ML model. Must complete before the payment processor call. Purpose: block obvious fraud immediately.
Tier 2 (async, seconds to minutes): Deep ML analysis, network graph analysis, behavioral analysis. Runs after the transaction is authorized. Purpose: flag transactions for review, trigger holds, initiate disputes for post-settlement fraud.
The split is deliberate: a deep GBM model scoring 300 features in 200ms is possible but expensive. Keep the sync path lightweight; move complexity to the async path.
Transaction Event (from payment service)
│
├─── Sync path (pre-auth, < 200ms):
│ │
│ ┌────▼────────────────────────────────────┐
│ │ Fraud Scoring Service │
│ │ 1. Feature extraction (from cache) │
│ │ 2. Rule engine evaluation │
│ │ 3. Lightweight ML model inference │
│ │ 4. Return: ALLOW / REVIEW / BLOCK │
│ └────────────────────────────────────────┘
│ │
│ Feature Cache (Redis):
│ - User transaction velocity (last 1h, 24h, 7d)
│ - Device history
│ - IP risk score (GeoIP + VPN detection)
│ - Merchant category risk
│
└─── Async path (post-auth):
│
Kafka topic: transaction-events
│
┌────▼────────────────────────────────────┐
│ Deep Analysis Worker │
│ 1. Full ML model (300+ features) │
│ 2. Graph analysis (account network) │
│ 3. Behavioral profiling │
│ 4. Device fingerprint correlation │
└────┬────────────────────────────────────┘
│
┌────▼────────────────┐
│ Case Management │ ← Analyst reviews
│ (flagged txns) │ FRAUD / NOT_FRAUD decision
└────┬────────────────┘
│
Feedback → Model Training Pipeline
The quality of features determines model accuracy. Critical features:
Velocity features (require fast computation):
- Transaction count in last 1/4/24 hours per user
- Dollar amount in last 1/4/24 hours per user
- Transaction count in last hour per card
- Failed attempts in last hour
Device and location features:
- Is this device seen before for this account?
- Is this IP address associated with a VPN/proxy/Tor?
- Geo-distance from last transaction (impossible travel: two transactions 5000 miles apart in 30 minutes)
- New device + new country combination (very high risk signal)
Behavioral features:
- Transaction at unusual hour for this user (1am when user typically transacts 9am-5pm)
- Merchant category deviation from user’s history
- Amount deviation (user typically spends $20–100, this is $2000)
Network features:
- Is this email address / card / device linked to known fraud accounts?
- Count of accounts sharing this device
- Count of accounts sharing this IP in last 24 hours
Feature computation challenge: Velocity features (count in last hour) can’t be pre-computed cheaply. Use Redis ZADD + ZCOUNT on sorted sets with timestamps to compute rolling windows in O(log N):
ZADD user:123:tx_timestamps <timestamp_ms> <tx_id>
ZCOUNT user:123:tx_timestamps <1h_ago_ms> +inf
Model type: Gradient Boosting (XGBoost, LightGBM) is the industry standard for tabular fraud features. Performs better than deep learning for structured data with this feature profile.
Sync path model: Smaller, faster model. 20–50 features, inference < 20ms. Acceptable AUC 0.85.
Async path model: Full model. 300+ features including graph features. Inference 100ms–1s. Higher AUC.
Serving infrastructure:
- Model stored in object storage (S3), versioned
- Served via TensorFlow Serving, Triton Inference Server, or BentoML
- Models hot-swapped without service restarts
- A/B testing: route X% of traffic to new model, compare fraud rates before full cutover
Model drift monitoring: Fraud patterns change (fraudsters adapt). Monitor:
- Feature distribution shift (PSI — Population Stability Index)
- Model score distribution shift
- False positive/negative rate over time
Retrain on a regular cadence + feedback from analyst labels.
Blocking a legitimate transaction is expensive:
- Customer experience damage (frustrated user, potential churn)
- Support cost (customer calls to dispute the block)
- Revenue loss
The false positive / false negative trade-off:
- Tighter threshold → fewer false negatives (catch more fraud) → more false positives (block legit users)
- Looser threshold → fewer false positives → more false negatives (miss more fraud)
Optimizing the threshold: Set different thresholds per risk segment. High-risk merchants (crypto, gift cards) → tighter. Low-risk merchants → looser. Premium customers with long history → much looser (their fraud rate is lower and the business cost of blocking them is higher).
Soft declines vs hard declines:
- Hard decline: transaction blocked outright
- Soft decline with step-up: “We noticed unusual activity. Please verify via OTP” → user verifies → proceeds. Preserves revenue, reduces false positives at slight friction cost.
Analyst decisions are the training signal. If an analyst marks a REVIEW transaction as NOT_FRAUD, the model learns from this.
Important: Labels are delayed and noisy.
- Chargebacks (confirmed fraud) arrive weeks after the transaction
- Analyst decisions introduce human bias
- Not all fraud is disputed (some users don’t notice small fraudulent charges)
Training data pipeline:
- Transaction events → feature store
- Labels from analyst decisions + chargebacks (delayed labels)
- Model training job (weekly or on-demand)
- Champion/challenger testing before production deployment
- Why not just use rules? Rules are interpretable and fast but brittle. Fraudsters learn the rules and adapt. ML generalizes to new fraud patterns. Ideal: rules for known patterns + ML for novel patterns.
- How do you detect account takeover vs payment fraud? ATO: behavioral signals at login (unusual device, IP, typing cadence). Payment fraud: signals at transaction time. Two models, two pipelines, shared feature infrastructure.
- Velocity limits as a fraud signal: 10 failed card attempts in 5 minutes is a carding attack. This is a rule, not ML. Rules handle these obvious cases; ML handles the subtle ones.
- Graph analysis: Fraudsters often reuse devices, IP addresses, email patterns across multiple accounts. Querying a graph of account-device-IP relationships reveals rings. Graph DB (Neo4j) or graph compute (Spark GraphX) for batch; in-memory graph for real-time.