System Design: Payment Gateway
A payment gateway design is one of the highest-stakes system design questions. Every architectural decision has a financial consequence — double charges, lost transactions, or fraud exposure. The interviewer is testing your understanding of consistency, idempotency, and the realities of financial systems.
Functional:
- Accept payment instrument (card, bank account, wallet) and charge it
- Support 3DS (3D Secure) authentication flow for card payments
- Async settlement — funds eventually transferred to merchant
- Refunds and partial refunds
- Transaction status API — client can poll for async results
- Webhook notifications on payment state changes
- Idempotent payment requests — retrying doesn’t double-charge
Non-functional:
- No double charges under any failure scenario
- Transaction durability — no lost payments once accepted
- p99 latency < 3s for synchronous payment initiation
- High availability for the charge endpoint (payment failure = revenue loss)
- PCI DSS compliance — cardholder data must be handled securely
- Audit trail for all transaction state changes
The defining challenge of payment systems: a client submits a payment, the server processes it, but the response is lost in transit. The client retries. Does the payment happen twice?
Solution: Client-generated idempotency keys
The client generates a unique idempotency key (UUID) for each payment intent. The server stores (idempotency_key, result). On retry with the same key, the server returns the stored result without re-processing.
Client sends: POST /payments
{ amount: 100, currency: "USD", idempotency_key: "uuid-1234" }
Server:
1. Check idempotency store: key "uuid-1234" exists? → return stored result
2. Not found → process payment → store result with key → return result
Client retry (same idempotency key):
→ Server returns stored result, no re-processing
Implementation details:
- Idempotency key has a TTL (24 hours is typical)
- Store the key atomically with the result:
INSERT INTO idempotency_keys ... ON CONFLICT DO NOTHING - The idempotency check and payment creation must happen in the same transaction
- If the payment is still in-flight (processing), return 202 Accepted with status: PROCESSING
Between your system and the payment processor (Stripe, Braintree, Adyen), you face the same problem: your call to the processor might fail after the processor charged the card.
Solution: Use the processor’s own idempotency keys. Stripe, for example, accepts an Idempotency-Key header — retry the same API call with the same key and Stripe returns the original result.
If the processor doesn’t support idempotency keys (older processors), you must query the processor for the transaction state before retrying a charge.
Payments are never just “succeeded” or “failed” — they go through states:
INITIATED → AUTHORIZING → AUTHORIZED → CAPTURING → CAPTURED → SETTLED
↓ ↓
AUTH_FAILED CAPTURE_FAILED
↓
FAILED
CAPTURED → REFUNDING → REFUNDED (full or partial)
Each state transition is an event, stored immutably. Never update a payment record in place — append state transition events (event sourcing is natural here).
The accounting invariant: for every debit, there must be an equal credit. No money appears or disappears.
Double-entry ledger:
-- Every transaction creates two ledger entries
INSERT INTO ledger_entries (account_id, amount, type, tx_id) VALUES
(merchant_escrow_account, +100.00, 'CREDIT', 'tx-123'), -- merchant receives
(customer_account, -100.00, 'DEBIT', 'tx-123'); -- customer pays
-- Balance is always: SUM(credits) - SUM(debits)
-- If these two rows aren't both committed, the books don't balance
Never update a ledger entry. Every adjustment is a new entry (reversal = negative entry).
Client
│
├─ POST /payments (synchronous initiation)
│ │
▼ ▼
API Gateway (auth, rate limiting, TLS termination)
│
▼
Payment Service (core)
├── Idempotency check → Idempotency DB (Redis or Postgres with TTL)
├── Validate → fraud scoring (sync, pre-auth)
├── Write INITIATED record → Payments DB (Postgres)
├── 3DS required? → Return requires_action with redirect URL
└── Call Payment Processor (Stripe/Adyen)
│ success → record AUTHORIZED in DB
│ failure → record FAILED in DB
└── Async: publish PaymentAuthorized/Failed event → Kafka
Settlement Service (async)
Listens to events → triggers capture (AUTHORIZED → CAPTURED) → SETTLED
Notification Service
Listens to events → sends webhooks to merchants
Ledger Service
Listens to CAPTURED/SETTLED events → records double-entry ledger rows
Handling raw card numbers (PAN) puts your entire system in PCI scope — expensive, complex compliance.
Best practice: Use a payment processor’s tokenization. The client sends card data directly to Stripe’s JS library or Adyen’s hosted fields. The processor returns a token. Your backend only ever sees the token — never the card number.
This dramatically reduces your PCI scope to SAQ A or SAQ A-EP (the simplest tiers).
Never store: Full card numbers (PAN), CVV/CVC (ever, even temporarily — PCI strictly prohibits this), magnetic stripe data.
3DS is an additional authentication step where the cardholder proves identity to their bank.
1. Your frontend initiates payment with card token
2. Your backend calls processor → processor returns requires_action with 3DS URL
3. Your frontend redirects user to 3DS page (bank's page)
4. User authenticates (OTP, biometric)
5. Bank redirects back to your return URL with a result token
6. Your backend calls processor to complete the payment with the 3DS token
7. Payment succeeds (liability shifted to bank for fraud)
Why 3DS matters architecturally: The payment flow is asynchronous — you initiate, wait for user action, then complete. Your system must store the pending payment state and pick up where it left off when the user returns.
Network timeout calling the processor:
- Retry with the same idempotency key
- If processor confirms the charge succeeded: record as AUTHORIZED
- If processor confirms the charge failed: record as FAILED
- If processor can’t be reached after N retries: leave in PENDING, retry async via a job
Double webhook from processor:
- Process the first; the second must be idempotent
- Store
processor_event_idwith uniqueness constraint — duplicate events fail the insert and are discarded
Database write fails after processor charges:
- The processor charged the card, but your DB write failed
- Recovery: reconciliation job compares your DB records with processor records on a schedule
- Any processor transaction with no corresponding DB record is flagged for investigation
Partial capture failure:
- Authorization succeeded, capture failed
- Don’t leave an uncaptured auth indefinitely (authorizations expire in 7 days typically)
- Retry capture; if repeatedly fails, void the authorization and notify merchant
- Why not use a RDBMS for everything? Payments are ACID — Postgres is the right choice. Redis for idempotency keys (speed, TTL). Columnar store for analytics and reporting.
- How do you handle refunds? Refunds are new ledger entries (reverse the original). Partial refunds: partial negative entries. Refund processing is similar to charge — idempotent, state-machined.
- How do you reconcile at end of day? Settlement files from processors compared against your ledger. Any discrepancy triggers investigation. This is standard operations in fintech.
- What if Stripe is down? Failover to a secondary processor (Adyen, Braintree) is possible but complex — different APIs, different tokenization. Most teams accept Stripe downtime rather than build multi-processor fallback.
- Fraud detection placement: Pre-auth scoring (fast, rules-based: block obviously fraudulent requests before sending to processor). Post-auth scoring (ML-based, async: flag for review, trigger dispute if fraud confirmed after settlement).