Charity Donation App
Reading guide This article is written for backend engineers preparing for system design interviews or building real-world payment-heavy systems.
By the end, you should be able to:
- Explain why a charity donation platform needs an intent-based payment model on top of Stripe PaymentIntents
- Sketch the end-to-end flow from “Click Donate” to eventual stats update
- Reason about idempotency, webhooks, reconciliation, and event-driven analytics under high load (10M+ donations in 3 days)
1. Overview
SummaryA high-volume charity donation platform using Stripe PaymentIntent and an internal intent-based state machine, with RabbitMQ for events and a reconciliation worker to guarantee no duplicate charges, no lost payments, and eventual consistency under load.
Scale assumptions
- Total volume
- ~10M donations
- Duration
- 3-day event
- Peak load
- hundreds QPS
- Design assumptions
- horizontal scaling of the Donation API, Stripe handling payment throughput, queue-based decoupling for stats and notifications
Design goals
No duplicate chargesAt most one successful charge per donation intent.Enforced via Stripe idempotency keys and an internal state machine.No lost paymentsEvery intent reaches a terminal state (SUCCEEDED or FAILED).Webhooks plus a reconciliation worker resolve PENDING/UNKNOWN against Stripe.Eventual consistencyThe system converges to a consistent state under failures or delayed webhooks.State machine, webhooks, and reconciliation worker.Minimal PCI exposureCard data never touches our servers.Client-side tokenization (Stripe Elements); minimal PCI scope.Near real-time charity statisticsPer-charity totals updated and exposed at high QPS.Event-driven updates via RabbitMQ and Redis cache.Functional requirements
Core business flows supported by the system.
Create donation intentUser selects charity, amount, and payment method; system creates an internal intent and a Stripe PaymentIntent.Returns intent ID and client secret to the client.Confirm paymentUser completes payment on the client (Stripe Elements).Stripe handles tokenization, 3DS, and authorization; client receives success, failure, or pending.Webhook handlingSystem receives Stripe webhooks (payment_intent.succeeded / payment_intent.payment_failed).Updates intent status; publishes events for stats and notifications.Charity statsPer-charity totals (amount, count) updated eventually and exposed via API.Read path cached (e.g. Redis) for high QPS.Receipt / notificationUser receives a receipt (e.g. email) after successful payment.Sending is asynchronous and idempotent.ReconciliationSystem periodically resolves intents stuck in PENDING/UNKNOWN by querying Stripe.Publishes reconciled events so stats and side effects stay consistent.Multi-charityMultiple charities supported; each donation tied to a charity.Stats isolated per charity.Non-functional requirements
System guarantees under expected and peak load.
No duplicate chargesAt most one successful charge per donation intent.Stripe idempotency keys and internal state machine.No lost paymentsEvery intent eventually reaches SUCCEEDED or FAILED.Webhooks plus reconciliation worker cover missed or delayed webhooks.ScalabilitySystem handles ~10M donations over 3 days.Donation API and consumers scale horizontally; payment throughput delegated to Stripe.LatencyIntent creation and redirect/confirmation stay within acceptable latency.Heavy work (stats, email) off the hot path.PCI / securityCard data never touches our servers.Client-side tokenization (Stripe Elements); minimal PCI scope.ObservabilityMonitoring and alerts for success rate, failure rate, UNKNOWN rate, webhook/queue lag, reconciliation backlog.AvailabilityPayment and intent creation remain available under expected load.Dependencies (DB, Stripe, queue) have clear failure modes and mitigations.
2. System Architecture
Core components and their responsibilities in the donation workflow.
Client (Web / iOS / Android)ClientCollects donation details and payment method; confirms payment via Stripe SDK.No card data on our servers.Donation API ServiceCore ServiceCreates and tracks donation intents; issues Stripe PaymentIntents; handles webhooks; publishes events.Central orchestration; horizontal scaling.StripeExternalPayment provider: PaymentIntents, tokenization, webhooks, idempotency.Relational DatabaseStoragePrimary source of truth fordonation_intents,webhook_events,charity_stats.Single source of truth; scales with Donation API.RabbitMQAsyncMessage broker / event bus for donation.succeeded and downstream consumers.Decouples payment path from stats and notifications.Notification ConsumerAsyncConsumes events; sends receipts (e.g. email) with idempotency.Reconciliation WorkerAsyncScans PENDING/UNKNOWN intents; queries Stripe to finalize state.Publishes reconciled events so stats and side effects stay consistent.RedisStorageStats cache for high-QPS charity totals.Read path for per-charity totals.
High-level architecture diagram (click to zoom):
3. Payment Model Overview (Stripe)
Capabilities provided by Stripe
PaymentIntents APICreate and confirm payment intents; single primitive for charge lifecycle.
Client-side tokenization (Stripe Elements)Collect card details securely without touching our servers.
WebhooksReal-time events for payment_intent.succeeded and payment_failed.
Idempotency KeysAt-most-once charge per key; we pass a key per donation intent.
Transaction lookup APIsQuery Stripe to resolve PENDING/UNKNOWN and reconcile state.
How this system uses them
StripeStripe ElementsMinimizes PCI exposure; card data never touches our servers.
Stripe PaymentIntentExternal payment execution primitive; we create and confirm via API.
Our systemInternal donation_intents tableBusiness-level intent state anchor; state machine lives here.
High-level idea
Money Flow
Stripe PaymentIntent is the source of truth.
Business Flow
Internal donation_intents table is the state machine.
Webhooks + a reconciliation worker bridge the two worlds and guarantee eventual consistency.
4. End-to-End Payment Flow
4.1 Sequence Diagram
Create Intent
- Client creates donation intent
- Server creates Stripe PaymentIntent
Payment Confirmation
- Client confirms payment
Webhook & Publish
- Stripe sends webhook
- System updates state + publishes event
Reconcile
- Reconciliation worker resolves pending/unknown states
System guarantees
No duplicate chargesIdempotency keys and webhook deduplication ensure at-most-once charge per intent.
No lost paymentsWebhooks plus a reconciliation worker cover failures and long-running payments.
Eventual consistencyStripe and internal state stay in sync via webhooks and reconciliation.
4.2 Detailed Flow (Stripe-based)
PhasesIntent Creation
SYNCFlow: Client → API → DB → Stripe → Client
Main Flow
- Client sends charityId, amount_cents, currency, email (optional
request_idfor idempotency).- Server validates input; uses Redis
SETNXto lockrequest_id(5–10s) to prevent duplicate intent creation.- Insert row into donation_intents with status = CREATED; obtain
intentId.- Create Stripe PaymentIntent with amount, currency, metadata: { intentId }.
- Return
intentIdandclient_secretto the client.Guarantee: At-most-once intent creation
request_id→ sameintentId+client_secret(idempotent response).- Redis
SETNXlock prevents duplicate inserts for the samerequest_id.API / Webhook involvedPOST/v1/donations/intentDB writesdonation_intents (intent_id, status=CREATED, charity_id, amount_cents, stripe_payment_intent_id)Failure & retry
- Client retries with same
request_id; server returns existing intent.- Stripe PaymentIntent create is retried with backoff.
Payment Confirmation
SYNCFlow: Client → Stripe (no server in path)
Main Flow
- User enters card details in Stripe-hosted fields (PCI-safe); client calls
stripe.confirmCardPayment(client_secret).- Stripe handles tokenization, 3DS (SCA), authorization and capture.
- Client receives immediate success/failure or "processing" (finalized later via webhook).
- No server call in this step; Donation Service does not see card data.
Guarantee: No duplicate confirmations
- Stripe deduplicates by PaymentIntent; same confirm returns same result.
- Client can send optional
request_idfor idempotent UI retries.API / Webhook involvedStripe.js confirmCardPayment (client-side)DB writesNone in this phase (state updated in Webhook Processing).Failure & retry
- Client retries
confirmCardPayment; Stripe returns same result.- For "processing", wait for webhook; no server retry in this phase.
Webhook Processing
ASYNCFlow: Stripe → API → DB + RabbitMQ
Main Flow
- Stripe sends POST to /v1/webhooks/stripe:
payment_intent.succeeded,payment_intent.payment_failed,payment_intent.processing.- Deduplicate by Stripe
event_id(store in webhook_events or Redis); RedisSETNXlock per payment_intent (5–10s).- Read
intentIdfrom PaymentIntent metadata; update donation_intents to SUCCEEDED or FAILED.- Publish domain event to RabbitMQ (e.g. donation.succeeded) for stats and email consumer.
Guarantee: At-most-once webhook processing
event_idstored; at-most-once processing per event.- Redis lock per payment_intent (5–10s) prevents duplicate handler runs.
API / Webhook involvedPOST/v1/webhooks/stripepayment_intent.succeeded / payment_failed / processingDB writeswebhook_events (event_id); donation_intents (status); outbox if used.Failure & retry
- Stripe retries webhook with exponential backoff.
- Return 200 after processing to avoid duplicate delivery.
Reconciliation / Convergence
RECONCILEFlow: Worker → Stripe API → DB + RabbitMQ
Main Flow
- Reconciliation worker periodically scans donation_intents with status PENDING or UNKNOWN.
- Queries Stripe API for each PaymentIntent to get final state (succeeded, failed, expired).
- Updates donation_intents to SUCCEEDED / FAILED / EXPIRED; publishes reconciled events to RabbitMQ.
- Downstream consumers (stats, email) process events idempotently; DB and Redis converge.
Guarantee: Read-only convergence, no duplicate charges
- Worker updates only if internal state still PENDING/UNKNOWN (read-only convergence).
- Consumers use
event_idorintent_idfor dedupe; no duplicate charges from reconciliation.API / Webhook involvedStripe API: retrieve PaymentIntent (server-side)DB writesdonation_intents (status); charity_stats; Redis cache (per-charity totals).Failure & retry
- Worker retries on next run; Stripe API retries with backoff.
- Read-only reconciliation; no duplicate charges.
Guarantee mappingWhich phases enforce each guarantee
No duplicate chargesIntent Creation (request_id + Redis lock), Webhook Processing (event_id + Redis lock per payment_intent). No lost paymentsWebhook Processing (Stripe retries); Reconciliation (worker resolves PENDING/UNKNOWN and long-running processing). Eventual consistencyWebhook Processing (updates DB + publish); Reconciliation (worker converges state); consumers (idempotent stats + Redis).
5. Payment State Machine
Visual flow
PENDING→PROCESSING→SUCCEEDED/FAILEDState definitions
PENDINGIntent created; payment not yet attempted.
Created when client requests an intent with
charityId,amount, and optionalrequest_id.Redis SETNXlock for idempotency.PROCESSINGPayment in progress; waiting for Stripe and webhook.
Intent stays here until Stripe confirms success, failure, or timeout.
Webhookorreconciliationwill transition to final state.SUCCEEDEDPayment succeeded. Final state.
Stats and side effects (e.g. notifications) are driven by
events. No further transitions allowed.FAILEDPayment failed or expired. Final state.
failure_reasonis recorded. No further charge attempts.Transition table
Initial state Event Target state Action PENDING Client confirms paymentPROCESSING Call Stripe; create/update PaymentIntent PROCESSING Webhook: payment_intent.succeededSUCCEEDED Update intent; publish donation.succeeded PROCESSING Webhook: payment_intent.payment_failedFAILED Update intent; set failure_reason PROCESSING Reconciliation (timeout)FAILED Resolve via Stripe API LOGIC HIGHLIGHTING
Transition guard
Only when current state is
PROCESSING(orUNKNOWNfor reconciliation) do we accept a webhook update toSUCCEEDED. UseWHERE status IN ('PROCESSING', 'UNKNOWN')and atomic update so only one transition wins.Idempotency
Same
request_idreturns the sameintent_idandclient_secret.Redis SETNXlock prevents duplicate intent creation. Once inSUCCEEDEDorFAILED, no further charge is possible.Concurrency handling
Webhook and reconciliation may both try to update the same intent. Use atomic updates and publish events after DB commit for eventual consistency.
6. Reconciliation Flow
Reconciliation is responsible for cleaning up long-lived PENDING / UNKNOWN intents when:
- Webhooks are not received
- Stripe API calls time out
PAYMENT_PENDINGhas lasted longer than a safe threshold
Pending/Unknown intentsReconciliation WorkerQuery Stripe APIStripe responseUpdate donation_intentsPublish reconciled eventFinalSUCCEEDEDFAILEDEXPIRED
When Stripe returns processing, the worker only updates next_reconcile_at and retry_count (no event published); see §6.2.
6.1 Reconciliation Sequence (Stripe + RabbitMQ)
6.2 When Stripe Returns processing (Long-Running Payment)
PrincipleWhen Stripe returns `processing`, do **not** mark the intent as failed. Use **exponential backoff** to reschedule checks and a **final TTL** (e.g. 24h); only then mark as **EXPIRED** if still unresolved.
If the Reconciliation Worker queries Stripe and the PaymentIntent is still processing (or in API terms, not yet succeeded or failed), the payment is in a long-running state. Examples: async payment methods (wire transfer, Sofort, SEPA), bank fraud checks, or 3DS opened but not completed. The charge may complete minutes or hours later.
Standard handling: exponential backoff + final TTL
Step 1 — Preserve non-terminal state
Keep local status as PAYMENT_PENDING (or UNKNOWN). Only move to SUCCEEDED, FAILED, or EXPIRED when appropriate.
Step 2 — Exponential backoff reschedule
Set a next check time with increasing intervals (e.g. 5 min → 15 min → 30 min → 1 h → 4 h). Store this in next_reconcile_at on donation_intents. The worker only picks intents where next_reconcile_at <= NOW().
Step 3 — Retry counter
Increment retry_count each time you re-query and still get processing. Use it to compute the next interval and to cap retries or alert if abnormally high.
Step 4 — Final TTL expiration
Define a maximum wait (e.g. 24 hours). If Stripe still returns processing after that, mark the intent as EXPIRED. In practice Stripe usually resolves within 24h; for large amounts you may notify the user or support.
Example logic (Java):
Javapublic void reconcileProcessingIntent(DonationIntent intent, PaymentIntent stripeIntent) {// Check if the intent is still processing// See Stripe docs: https://stripe.com/docs/payments/payment-intents/lifecycleif ("processing".equals(stripeIntent.getStatus())) {System.out.println("Processing intent " + intent.getId());// TODO: Calculate the next retry time with exponential backoff// e.g. 5m, 15m, 1h, 4h...Instant nextCheckTime = calculateNextRetry(intent.getRetryCount());// Check if we've exceeded the maximum TTL (e.g. 24 hours)boolean isExpired = intent.getCreatedAt().plus(Duration.ofHours(MAX_TTL_HOURS)).isBefore(Instant.now());if (isExpired) {// TODO: Mark as EXPIRED or verify one last timedonationIntentRepository.updateStatus(intent.getId(), IntentStatus.EXPIRED);} else {// Update the intent with the new retry time and statusdonationIntentRepository.update(intent.getId(), UpdateIntent.builder().nextReconcileAt(nextCheckTime).retryCount(intent.getRetryCount() + 1).lastStripeStatus("processing").build());}}}
Why processing happens
- Async payment methods (wire, Sofort, SEPA, etc.) can take 1–3 business days.
- Bank delays (extra fraud or compliance checks).
- 3DS pending (user opened verification but did not complete or close).
Index for the worker
Use a composite index on (status, next_reconcile_at) so each run can efficiently select intents that are due for a check:
SQLWHERE status IN ('PAYMENT_PENDING', 'UNKNOWN') AND next_reconcile_at <= NOW()
This avoids full table scans and keeps reconciliation bounded to intents that are due for a check. The schema in §8.1 includes retry_count, next_reconcile_at, and last_stripe_status for this flow.
7. Event-Driven Stats Update
After a donation succeeds, downstream updates should be event-driven (stats, notifications, audit):
Consumer responsibilities
- Deduplicate events (e.g. by
intentId + SUCCEEDED). - Retry on transient failures with backoff.
- Maintain eventual consistency between DB, cache, and side effects.
7.1 Summary Table (charity_stats): Real-Time vs Consistency
Principle
Use a dedicated summary table for per-charity totals; populate it with a strategy that balances real-time needs and consistency (e.g. async consumer + atomic updates + periodic calibration).
Why a separate table?
Although you could compute totals with SELECT SUM(amount_cents) FROM donation_intents WHERE charity_id = 1 AND status = 'SUCCEEDED', once donation volume reaches tens or hundreds of thousands of rows, this query becomes slow and can overload the database. A pre-aggregated charity_stats table (see §8.3) gives fast reads for dashboards and APIs.
Schema (reference): charity_id (PK), total_amount_cents, donation_count, updated_at.
How the table is populated — three common approaches:
| Approach | Description | Pros | Cons |
|---|---|---|---|
| A. In-transaction sync update | In the same DB transaction that writes the donation/transaction, run UPDATE charity_stats SET total_amount_cents = total_amount_cents + :amount WHERE charity_id = :id. | Strong consistency; DB totals are always correct. | Row lock contention. A hot charity with many donations per second serializes on that single row; throughput drops. |
| B. Async message-queue driven (recommended) | Webhook only persists the transaction and publishes an event; a Stats Worker (RabbitMQ consumer) runs UPDATE charity_stats .... | Peak smoothing. Bursts of donations are processed in the background; the webhook path stays fast. | Slight delay (e.g. hundreds of ms), acceptable for dashboards. |
| C. Redis increment + periodic flush | All real-time increments go to Redis (e.g. HINCRBY); a worker every N minutes flushes Redis deltas into charity_stats. | Very low DB write load; good for extreme peaks (e.g. thousands of donations/sec). | More moving parts; still need a flush worker and calibration. |
Best practice: incremental atomic update
Always update the summary table with atomic in-place addition, not read-then-write.
Wrong (race-prone):
SQL-- Read then write: another request can overwrite your updateSELECT total_amount_cents FROM charity_stats WHERE charity_id = 1;-- application: new_total = old + 50UPDATE charity_stats SET total_amount_cents = 150 WHERE charity_id = 1;
Correct (atomic):
SQLUPDATE charity_statsSET total_amount_cents = total_amount_cents + :amount_cents,donation_count = donation_count + 1,updated_at = NOW()WHERE charity_id = :charity_id;
Eventually consistent safety net (reconciliation as calibrator)
No matter which approach you use, the summary table can drift from the transaction table (e.g. due to bugs or partial failures). A Reconciliation Worker should act as a calibrator:
- Periodically (e.g. daily) run:
SELECT charity_id, SUM(amount_cents) AS total FROM donation_intents WHERE status = 'SUCCEEDED' GROUP BY charity_id. - Compare with
charity_statsand UPDATEcharity_statswith the computed totals to correct drift.
This keeps the summary table eventually consistent with the source of truth (the transaction/intent table).
Recommendation (summary)
- Structure: Keep a charity_stats table (§8.3).
- Population: Have the RabbitMQ consumer (Stats Worker) that already handles donation events asynchronously update
charity_statsafter updating Redis (approach B). - Safety: Use atomic
SET total_amount_cents = total_amount_cents + :inc(and same fordonation_count). - Calibration: Run a daily (or periodic) full SUM from the transaction/intent table and correct
charity_statsto handle any residual drift.
7.2 Where to Update Redis: In the Consumer, Not the Webhook
Principle
Redis totals are updated in the message consumer, not in the Webhook. This gives a single writer for stats and eventual consistency without overloading the Webhook path.
Why not update Redis in the Webhook?
- Distributed transaction risk: If the Webhook updates the DB and then Redis, and Redis fails (e.g. network blip) after you have already returned 200 OK to Stripe, the Redis total is short by that donation until a reconciliation or calibration run. You cannot roll back the 200.
- Latency and throughput: Stripe holds the HTTP connection until the Webhook responds. Any extra work (Redis, push, etc.) in the Webhook slows the response and limits how many Webhooks you can handle per second.
- Heavy logic: Even with locks, doing stats and push in the Webhook makes the hot path complex and increases the chance of timeouts and retries.
Standard flow: Webhook as producer, consumer as single writer
| Step | Webhook (producer) | Stats consumer |
|---|---|---|
| 1 | Verify signature; persist intent status SUCCEEDED in DB. | — |
| 2 | Publish one message to RabbitMQ (e.g. charity_id, amount_cents, intent_id). | — |
| 3 | Return 200 OK to Stripe immediately. | — |
| 4 | — | Update Redis (e.g. HINCRBY charity:total:{charity_id} amount_cents). |
| 5 | — | Optionally trigger push (WebSocket / batch notification). |
| 6 | — | ACK the message only after Redis (and DB if applicable) update succeeds. |
If the consumer does not ACK (e.g. Redis is down), the broker will redeliver the message. No 200 has been sent for that work, so there is no “already committed” response to Stripe.
Failure handling in the consumer
- Redis down: Do not ACK. The queue will retry (e.g. after 5 seconds). After N failed attempts, send the message to a dead-letter queue (DLQ) and alert; a compensation job or operator can replay or fix once Redis is healthy.
- Duplicate delivery: If the consumer updates Redis but crashes before ACK, the same message can be processed again. Use idempotency: e.g. in Redis,
SETNX processed:intent:{intent_id}with a TTL (e.g. 24 hours). If SETNX fails, treat the message as already processed—skip the increment and ACK to avoid double-counting.
Optional: batch aggregation in the consumer
To reduce Redis and push load during bursts:
- Batch consume: Pull multiple messages at once (e.g. 50).
- Aggregate in memory: Sum
amount_centspercharity_idfor that batch. - One Redis update per charity: e.g.
HINCRBY charity:total:{charity_id} {sum}once per charity in the batch. - One push per charity: Send a single WebSocket or notification per charity for the batch.
Summary
Keep the Webhook thin: verify, persist, publish, return 200. All stats (Redis and DB summary table) and optional push run in the Stats Consumer, with retries, DLQ, and idempotency so the system stays consistent and scalable.
8. Data Model
The data model keeps intent state, webhook deduplication, and read-optimized aggregates separated:
8.1 donation_intents
TABLEdonation_intentsOne row per user donation intent, used as the durable business anchor for Stripe PaymentIntents.
intent_idUUIDPKInternal business identifier for the donation intent (also used in URLs and logs).stripe_payment_intent_idVARCHAR(255)FKForeign reference to the Stripe PaymentIntent (`pi_...`).Donor email address, used for receipts and communication.charity_idVARCHAR(255)FKIdentifier of the target charity receiving this donation.amount_centsINTEGERDonation amount in the smallest currency unit (e.g. cents).statusVARCHAR(32)High-level state (CREATED, PAYMENT_PENDING, SUCCEEDED, FAILED, UNKNOWN, EXPIRED).failure_reasonVARCHAR(255)NULLOptional machine-readable failure reason when the payment does not succeed.retry_countINTEGERNumber of reconciliation attempts for this intent (used when Stripe returns processing).next_reconcile_atTIMESTAMPNULLWhen the reconciliation worker should re-check this intent (exponential backoff).last_stripe_statusVARCHAR(255)NULLLast status returned from Stripe (e.g. processing) for debugging and backoff logic.created_atTIMESTAMPWhen the intent row was first created.updated_atTIMESTAMPLast time the intent row was updated.
8.2 webhook_events
TABLEwebhook_eventsStores Stripe webhook deliveries for idempotent processing and auditability.
stripe_event_idVARCHAR(255)PKUnique Stripe event id, used to deduplicate webhook deliveries.intent_idUUIDFKForeign key back to `donation_intents.intent_id`.received_atTIMESTAMPWhen this webhook was first received by the system.
8.3 charity_stats
TABLEcharity_statsAggregated donation statistics per charity for fast reads.
charity_idVARCHAR(255)PKIdentifier of the charity (matches the primary key in the charities table).total_amount_centsINTEGERTotal donated amount for this charity in cents.donation_countINTEGERNumber of successful donations recorded for this charity.updated_atTIMESTAMPLast time the aggregated stats row was updated.
9. Idempotency Strategy
9.1 Confirm Idempotency
- Gateway-level: Stripe PaymentIntent prevents duplicate charges at the provider boundary.
- Business-level: internal
donation_intentstransitions ensure each intent is finalized at most once.
9.2 Webhook Idempotency
- Storage guard:
webhook_eventsenforces uniquestripe_event_id, so the same webhook cannot be applied twice.
9.3 Consumer Idempotency
- Message guard: consumers deduplicate by business key (e.g.
intentId + SUCCEEDED). - Replay safety: event replay is supported without double-counting stats.
10. Alternatives Considered
Direct Charge Without Internal Intent
RejectedWhy:
- Hard to reconcile against Stripe without an internal durable anchor
- No single place to reason about business-level state
Synchronous Stats Update
RejectedWhy:
- Slows down the confirmation endpoint
- Couples the hot payment path to aggregation and reporting
No Reconciliation Worker
RejectedWhy:
- Webhooks are not guaranteed to be delivered
- Timeouts and network partitions are inevitable; you need an explicit repair loop
11. Risks & Mitigations
Duplicate Charges
Risk: Users might be charged twice in edge cases.
Mitigation: rely on Stripe PaymentIntent idempotency + enforce a single successful transition per intent.
Lost Webhooks
Risk: Missing webhooks leave intents stuck in PAYMENT_PENDING / UNKNOWN.
Mitigation: Reconciliation Worker periodically re-queries Stripe and finalizes intent state.
Stats Inconsistency
Risk: charity_stats and Redis totals can diverge.
Mitigation: consumer-side deduplication + replay-safe repair path.
12. Monitoring & Metrics
Payment Metrics
- Success rate
- Failure rate
- UNKNOWN rate
- Stripe latency p95 / p99
System Metrics
- RabbitMQ queue depth / consumer lag
- Reconcile backlog
- DB lock wait time
Alerts
- UNKNOWN ratio > 2%
- Spike in Webhook processing errors
- Reconciliation backlog growing over time
13. SLOs
Use explicit SLOs so failure handling and scaling decisions stay measurable:
- Payment success availability: >= 99.9% successful intent processing (excluding external issuer declines).
- Webhook acknowledgment latency: p95 < 2s, p99 < 5s.
- UNKNOWN ratio: < 2% over rolling 15 minutes.
- Reconciliation completion window: intents in
UNKNOWN/PAYMENT_PENDINGare revisited within configured backoff + TTL policy. - Stats freshness: dashboard totals converge within an agreed lag window (e.g. <= 60s under normal load).
14. Scaling Strategy
- API tier: horizontally scale Donation API instances.
- Payment execution tier: delegate charge-path scaling to Stripe.
- Queue tier: shard/route RabbitMQ by
intentId(routing key or hash). - Consumer tier: auto-scale workers from queue depth + processing lag.
- Read tier: front high-QPS stats endpoints with Redis.
15. Summary
This design:
- Uses Stripe PaymentIntent in the intended, safe way to avoid duplicate charges.
- Separates business intents from payment execution; internal state machine is the anchor.
- Treats webhooks as the authoritative source of payment state.
- Uses a Reconciliation Worker to handle UNKNOWN states and lost webhooks.
- Decouples stats and side effects (emails, notifications) via events.
- Maintains strong end-to-end idempotency guarantees.
16. Interview-Oriented Discussion Questions
If you are using this document to prepare for system design interviews, here are some follow-up questions to challenge your understanding:
-
Hot path vs. cold path
- Where is the true hot path in this system?
- If you had to cut 50% of the complexity, what could you remove from the cold path without violating the SLOs?
-
Failure modes and trade-offs
- What happens if Stripe Webhooks are delayed by 30 minutes during peak traffic?
- How would you surface “eventual success” vs “final failure” back to the user and to internal operations?
-
Backpressure and rate limiting
- How would you protect the Donation API from sudden traffic spikes (e.g., a celebrity tweet) without losing valid donations?
- Where would you place rate limiting and circuit breakers (client, API gateway, Donation Service, Stripe)?
-
Multi-tenant and per-charity isolation
- How would you isolate noisy or misbehaving charities so that they do not affect others?
- Would you shard
donation_intents/charity_statsby charity, region, or something else?
-
Schema and evolution
- If you later add recurring donations or refunds, how would you extend the current data model and state machine?
- Which parts of the design are most fragile under such changes?
-
Cost and observability
- Which components are likely to dominate your cloud bill (Stripe fees, DB, RabbitMQ, Redis, compute)?
- What metrics and dashboards would you build first to catch regressions in payment success rate?
Try to answer these questions using the flows, state machine, and data model in this article. In a real interview, you can treat this system as a “pattern” and adapt it to any payment-heavy or intent-based workflow.