Overview
When you click the Buy button on Amazon or any e-commerce website, a complex payment processing system springs into action. This case study explores how money moves through a payment system, from the initial purchase to final settlement.Payment Flow Architecture
Let’s trace how a payment moves through the system step by step:1. Payment Event Generation
User Action
When a user clicks the “Buy” button, a payment event is generated and sent to the payment service.
- Ensure no payment request is lost
- Enable recovery from failures
- Maintain audit trail
- Support dispute resolution
2. Payment Order Processing
Order Decomposition
A single payment event may contain multiple payment orders. For example, you might buy products from several sellers in one checkout.The payment service breaks down the event into individual payment orders, one per seller.
- Stripe
- PayPal
- Adyen
- Square
- Braintree
3. Balance Updates
Wallet Update
After successful payment execution:
- Payment service calls the wallet service
- Wallet service updates the seller’s balance
- Updated balance is persisted in the wallet database
4. Settlement Process
Key Components Explained
Payment Service
Payment Service
Role: Orchestrates the entire payment flowResponsibilities:
- Receive and validate payment events
- Break down events into individual orders
- Coordinate with payment executor, wallet, and ledger
- Handle payment state management
- Manage retries and error handling
- Idempotency: Same payment request should not be processed twice
- State machine: Track payment through various states (pending, processing, completed, failed)
- Error handling: Gracefully handle failures and timeouts
Payment Executor
Payment Executor
Role: Execute payments through external PSPsResponsibilities:
- Store payment orders before processing
- Integrate with multiple PSPs (Stripe, PayPal, etc.)
- Handle PSP-specific protocols and formats
- Manage payment method details securely
- Retry failed payments with exponential backoff
- Never store raw credit card numbers (use tokens)
- PCI DSS compliance
- Secure communication with PSPs (TLS)
- Encrypt sensitive data at rest
Wallet Service
Wallet Service
Role: Manage account balances for all partiesResponsibilities:
- Track balances for sellers and buyers
- Process debits and credits atomically
- Prevent negative balances
- Support multiple currencies
- Handle holds and reservations
Ledger Service
Ledger Service
Role: Maintain immutable record of all financial transactionsCharacteristics:
- Append-only: Never update or delete entries
- Double-entry bookkeeping: Every transaction has equal debits and credits
- Audit trail: Complete history of all money movements
- Reconciliation: Match internal records with bank statements
- Wallet = current state (“How much money do I have now?”)
- Ledger = complete history (“How did I get this balance?”)
- Separation enables independent scaling and optimization
10 Principles for Resilient Payment Systems
Based on Shopify’s experience processing billions in payments:
1. Lower Timeouts
1. Lower Timeouts
Problem: Default timeout of 60 seconds is too longSolution:
- Read timeout: 5 seconds
- Write timeout: 1 second
2. Install Circuit Breakers
2. Install Circuit Breakers
Problem: Cascading failures when downstream services failSolution: Use circuit breakers to stop calling failing servicesExample: Shopify’s Semian library protects:
- Net::HTTP
- MySQL
- Redis
- gRPC services
- Closed: Normal operation
- Open: Stop calling service (return cached/default response)
- Half-open: Test if service recovered
3. Capacity Management
3. Capacity Management
Formula:Example:
- 50 requests in queue
- 100ms average processing time
- Throughput = 500 requests/second
4. Monitoring & Alerting
4. Monitoring & Alerting
Four Golden Signals (from Google SRE):
- Latency: How long requests take
- Traffic: How many requests you’re getting
- Errors: Rate of failed requests
- Saturation: How full your service is
5. Structured Logging
5. Structured Logging
Requirements:
- Centralized logging system
- Structured format (JSON)
- Easily searchable
- Correlated by request ID
6. Use Idempotency Keys
6. Use Idempotency Keys
Problem: Network failures can cause duplicate paymentsSolution: Client provides unique idempotency keyImplementation:Why ULID over UUID?
- Lexicographically sortable
- More compact
- Contains timestamp
7. Reconciliation
7. Reconciliation
Purpose: Ensure internal records match bank/PSP statementsProcess:
- Receive settlement file from PSP/bank
- Compare with internal ledger
- Identify discrepancies
- Investigate and resolve breaks
- Store results in database
8. Load Testing
8. Load Testing
Strategy: Regularly simulate high-volume scenariosShopify’s Approach:
- Simulate flash sales (Black Friday, Cyber Monday)
- Test with 10x normal traffic
- Measure latency at different percentiles (p50, p95, p99)
- Identify bottlenecks before they hit production
9. Incident Management
9. Incident Management
Three Key Roles:
- Incident Manager on Call (IMOC): Coordinates response
- Support Response Manager (SRM): Handles customer communication
- Service Owners: Fix the technical issue
- Detect → Alert → Assemble team → Diagnose → Fix → Communicate → Document
10. Incident Retrospectives
10. Incident Retrospectives
Three Questions:
- What exactly happened? (Timeline of events)
- What incorrect assumptions did we hold? (Root cause)
- What can we do to prevent this? (Action items)
Design Tradeoffs
Consistency vs. Availability
Consistency vs. Availability
Challenge: In a distributed payment system, network partitions can occurDecision: Choose consistency over availabilityRationale:
- Better to fail a payment than process it twice
- Financial accuracy is non-negotiable
- Temporary unavailability is acceptable; incorrect balances are not
- Use distributed transactions (2PC, Saga pattern)
- Strong consistency for wallet updates
- Eventual consistency acceptable for analytics/reporting
Sync vs. Async Processing
Sync vs. Async Processing
Synchronous (for critical path):
- Payment validation
- Wallet debits/credits
- Ledger recording
- Email notifications
- Analytics updates
- Fraud scoring (post-authorization)
- Webhooks to merchants
Build vs. Buy (PSP Integration)
Build vs. Buy (PSP Integration)
Don’t build:
- Credit card processing (PCI DSS complexity)
- Fraud detection (requires massive data)
- Bank integrations (regulatory requirements)
- Payment orchestration layer
- Wallet and ledger systems
- Business logic and rules
- Reconciliation systems
Security Considerations
PCI DSS Compliance
Never store raw credit card numbers. Use tokenization provided by PSPs.
Encryption
Encrypt all sensitive data at rest and in transit (TLS 1.2+).
Fraud Detection
Integrate with fraud detection services. Monitor for suspicious patterns.
Rate Limiting
Prevent brute force attacks on payment endpoints.
Audit Logging
Log all payment actions with user, timestamp, and outcome.
Access Control
Strict RBAC for payment system access. Separate read/write permissions.
Handling Failures
Double Payment Prevention
Problem: Network timeout after PSP charges card but before response received Solution:Wallet Update Failure
Problem: Payment succeeds at PSP but wallet update fails Solution: Use distributed transactions Saga Pattern:Key Technologies
Database
PostgreSQL with ACID transactions for wallet and ledger
Message Queue
Kafka for async processing (notifications, webhooks)
Cache
Redis for idempotency keys and rate limiting
Monitoring
Prometheus + Grafana for metrics, PagerDuty for alerts
Summary
Building a resilient payment system requires:Strong Consistency
Use ACID transactions for financial operations. Never compromise on data accuracy.
Comprehensive Logging
Maintain immutable ledger and detailed audit logs for compliance and debugging.
Payment systems are among the most critical systems in software engineering. Prioritize correctness and reliability over performance. It’s better to process payments slowly than to process them incorrectly.