Charity Donation App

Reading guide This article is written for backend engineers preparing for system design interviews or building real-world payment-heavy systems.
By the end, you should be able to:

  • Explain why a charity donation platform needs an intent-based payment model on top of Stripe PaymentIntents
  • Sketch the end-to-end flow from “Click Donate” to eventual stats update
  • Reason about idempotency, webhooks, reconciliation, and event-driven analytics under high load (10M+ donations in 3 days)

1. Overview

Summary

A high-volume charity donation platform using Stripe PaymentIntent and an internal intent-based state machine, with RabbitMQ for events and a reconciliation worker to guarantee no duplicate charges, no lost payments, and eventual consistency under load.

Scale assumptions
Total volume
~10M donations
Duration
3-day event
Peak load
hundreds QPS
Design assumptions
horizontal scaling of the Donation API, Stripe handling payment throughput, queue-based decoupling for stats and notifications

Design goals

No duplicate charges
At most one successful charge per donation intent.
Enforced via Stripe idempotency keys and an internal state machine.
No lost payments
Every intent reaches a terminal state (SUCCEEDED or FAILED).
Webhooks plus a reconciliation worker resolve PENDING/UNKNOWN against Stripe.
Eventual consistency
The system converges to a consistent state under failures or delayed webhooks.
State machine, webhooks, and reconciliation worker.
Minimal PCI exposure
Card data never touches our servers.
Client-side tokenization (Stripe Elements); minimal PCI scope.
Near real-time charity statistics
Per-charity totals updated and exposed at high QPS.
Event-driven updates via RabbitMQ and Redis cache.

Functional requirements

Core business flows supported by the system.

Create donation intent
User selects charity, amount, and payment method; system creates an internal intent and a Stripe PaymentIntent.
Returns intent ID and client secret to the client.
Confirm payment
User completes payment on the client (Stripe Elements).
Stripe handles tokenization, 3DS, and authorization; client receives success, failure, or pending.
Webhook handling
System receives Stripe webhooks (payment_intent.succeeded / payment_intent.payment_failed).
Updates intent status; publishes events for stats and notifications.
Charity stats
Per-charity totals (amount, count) updated eventually and exposed via API.
Read path cached (e.g. Redis) for high QPS.
Receipt / notification
User receives a receipt (e.g. email) after successful payment.
Sending is asynchronous and idempotent.
Reconciliation
System periodically resolves intents stuck in PENDING/UNKNOWN by querying Stripe.
Publishes reconciled events so stats and side effects stay consistent.
Multi-charity
Multiple charities supported; each donation tied to a charity.
Stats isolated per charity.

Non-functional requirements

System guarantees under expected and peak load.

No duplicate charges
At most one successful charge per donation intent.
Stripe idempotency keys and internal state machine.
No lost payments
Every intent eventually reaches SUCCEEDED or FAILED.
Webhooks plus reconciliation worker cover missed or delayed webhooks.
Scalability
System handles ~10M donations over 3 days.
Donation API and consumers scale horizontally; payment throughput delegated to Stripe.
Latency
Intent creation and redirect/confirmation stay within acceptable latency.
Heavy work (stats, email) off the hot path.
PCI / security
Card data never touches our servers.
Client-side tokenization (Stripe Elements); minimal PCI scope.
Observability
Monitoring and alerts for success rate, failure rate, UNKNOWN rate, webhook/queue lag, reconciliation backlog.
Availability
Payment and intent creation remain available under expected load.
Dependencies (DB, Stripe, queue) have clear failure modes and mitigations.

2. System Architecture

Core components and their responsibilities in the donation workflow.

Client (Web / iOS / Android)Client
Collects donation details and payment method; confirms payment via Stripe SDK.
No card data on our servers.
Donation API ServiceCore Service
Creates and tracks donation intents; issues Stripe PaymentIntents; handles webhooks; publishes events.
Central orchestration; horizontal scaling.
StripeExternal
Payment provider: PaymentIntents, tokenization, webhooks, idempotency.
Relational DatabaseStorage
Primary source of truth for donation_intents, webhook_events, charity_stats.
Single source of truth; scales with Donation API.
RabbitMQAsync
Message broker / event bus for donation.succeeded and downstream consumers.
Decouples payment path from stats and notifications.
Notification ConsumerAsync
Consumes events; sends receipts (e.g. email) with idempotency.
Reconciliation WorkerAsync
Scans PENDING/UNKNOWN intents; queries Stripe to finalize state.
Publishes reconciled events so stats and side effects stay consistent.
RedisStorage
Stats cache for high-QPS charity totals.
Read path for per-charity totals.

High-level architecture diagram (click to zoom):

Charity Donation App – high-level architecture diagram

3. Payment Model Overview (Stripe)

Capabilities provided by Stripe

PaymentIntents API

Create and confirm payment intents; single primitive for charge lifecycle.

Client-side tokenization (Stripe Elements)

Collect card details securely without touching our servers.

Webhooks

Real-time events for payment_intent.succeeded and payment_failed.

Idempotency Keys

At-most-once charge per key; we pass a key per donation intent.

Transaction lookup APIs

Query Stripe to resolve PENDING/UNKNOWN and reconcile state.

How this system uses them

Stripe
Stripe Elements

Minimizes PCI exposure; card data never touches our servers.

Stripe PaymentIntent

External payment execution primitive; we create and confirm via API.

Our system
Internal donation_intents table

Business-level intent state anchor; state machine lives here.

High-level idea

Money Flow

Stripe PaymentIntent is the source of truth.

Business Flow

Internal donation_intents table is the state machine.

Webhooks + a reconciliation worker bridge the two worlds and guarantee eventual consistency.


4. End-to-End Payment Flow

4.1 Sequence Diagram

Create Intent
  1. Client creates donation intent
  2. Server creates Stripe PaymentIntent
Payment Confirmation
  1. Client confirms payment
Webhook & Publish
  1. Stripe sends webhook
  2. System updates state + publishes event
Reconcile
  1. Reconciliation worker resolves pending/unknown states
System guarantees
  • No duplicate charges

    Idempotency keys and webhook deduplication ensure at-most-once charge per intent.

  • No lost payments

    Webhooks plus a reconciliation worker cover failures and long-running payments.

  • Eventual consistency

    Stripe and internal state stay in sync via webhooks and reconciliation.

4.2 Detailed Flow (Stripe-based)

Intent Creation

SYNC

Flow: Client → API → DB → Stripe → Client

Main Flow
  • Client sends charityId, amount_cents, currency, email (optional request_id for idempotency).
  • Server validates input; uses Redis SETNX to lock request_id (5–10s) to prevent duplicate intent creation.
  • Insert row into donation_intents with status = CREATED; obtain intentId.
  • Create Stripe PaymentIntent with amount, currency, metadata: { intentId }.
  • Return intentId and client_secret to the client.
Guarantee: At-most-once intent creation
  • request_id → same intentId + client_secret (idempotent response).
  • Redis SETNX lock prevents duplicate inserts for the same request_id.
API / Webhook involved
POST/v1/donations/intent
DB writes
donation_intents (intent_id, status=CREATED, charity_id, amount_cents, stripe_payment_intent_id)
Failure & retry
  • Client retries with same request_id; server returns existing intent.
  • Stripe PaymentIntent create is retried with backoff.

Payment Confirmation

SYNC

Flow: Client → Stripe (no server in path)

Main Flow
  • User enters card details in Stripe-hosted fields (PCI-safe); client calls stripe.confirmCardPayment(client_secret).
  • Stripe handles tokenization, 3DS (SCA), authorization and capture.
  • Client receives immediate success/failure or "processing" (finalized later via webhook).
  • No server call in this step; Donation Service does not see card data.
Guarantee: No duplicate confirmations
  • Stripe deduplicates by PaymentIntent; same confirm returns same result.
  • Client can send optional request_id for idempotent UI retries.
API / Webhook involved
Stripe.js confirmCardPayment (client-side)
DB writes
None in this phase (state updated in Webhook Processing).
Failure & retry
  • Client retries confirmCardPayment; Stripe returns same result.
  • For "processing", wait for webhook; no server retry in this phase.

Webhook Processing

ASYNC

Flow: Stripe → API → DB + RabbitMQ

Main Flow
  • Stripe sends POST to /v1/webhooks/stripe: payment_intent.succeeded, payment_intent.payment_failed, payment_intent.processing.
  • Deduplicate by Stripe event_id (store in webhook_events or Redis); Redis SETNX lock per payment_intent (5–10s).
  • Read intentId from PaymentIntent metadata; update donation_intents to SUCCEEDED or FAILED.
  • Publish domain event to RabbitMQ (e.g. donation.succeeded) for stats and email consumer.
Guarantee: At-most-once webhook processing
  • event_id stored; at-most-once processing per event.
  • Redis lock per payment_intent (5–10s) prevents duplicate handler runs.
API / Webhook involved
POST/v1/webhooks/stripe
payment_intent.succeeded / payment_failed / processing
DB writes
webhook_events (event_id); donation_intents (status); outbox if used.
Failure & retry
  • Stripe retries webhook with exponential backoff.
  • Return 200 after processing to avoid duplicate delivery.

Reconciliation / Convergence

RECONCILE

Flow: Worker → Stripe API → DB + RabbitMQ

Main Flow
  • Reconciliation worker periodically scans donation_intents with status PENDING or UNKNOWN.
  • Queries Stripe API for each PaymentIntent to get final state (succeeded, failed, expired).
  • Updates donation_intents to SUCCEEDED / FAILED / EXPIRED; publishes reconciled events to RabbitMQ.
  • Downstream consumers (stats, email) process events idempotently; DB and Redis converge.
Guarantee: Read-only convergence, no duplicate charges
  • Worker updates only if internal state still PENDING/UNKNOWN (read-only convergence).
  • Consumers use event_id or intent_id for dedupe; no duplicate charges from reconciliation.
API / Webhook involved
Stripe API: retrieve PaymentIntent (server-side)
DB writes
donation_intents (status); charity_stats; Redis cache (per-charity totals).
Failure & retry
  • Worker retries on next run; Stripe API retries with backoff.
  • Read-only reconciliation; no duplicate charges.
Guarantee mappingWhich phases enforce each guarantee
  • No duplicate charges
    Intent Creation (request_id + Redis lock), Webhook Processing (event_id + Redis lock per payment_intent).
  • No lost payments
    Webhook Processing (Stripe retries); Reconciliation (worker resolves PENDING/UNKNOWN and long-running processing).
  • Eventual consistency
    Webhook Processing (updates DB + publish); Reconciliation (worker converges state); consumers (idempotent stats + Redis).

5. Payment State Machine

Visual flow

PENDINGPROCESSINGSUCCEEDED/FAILED

State definitions

PENDING

Intent created; payment not yet attempted.

Created when client requests an intent with charityId, amount, and optional request_id. Redis SETNX lock for idempotency.

PROCESSING

Payment in progress; waiting for Stripe and webhook.

Intent stays here until Stripe confirms success, failure, or timeout. Webhook or reconciliation will transition to final state.

SUCCEEDED

Payment succeeded. Final state.

Stats and side effects (e.g. notifications) are driven by events. No further transitions allowed.

FAILED

Payment failed or expired. Final state.

failure_reason is recorded. No further charge attempts.

Transition table

Initial stateEventTarget stateAction
PENDINGClient confirms paymentPROCESSINGCall Stripe; create/update PaymentIntent
PROCESSINGWebhook: payment_intent.succeededSUCCEEDEDUpdate intent; publish donation.succeeded
PROCESSINGWebhook: payment_intent.payment_failedFAILEDUpdate intent; set failure_reason
PROCESSINGReconciliation (timeout)FAILEDResolve via Stripe API

LOGIC HIGHLIGHTING

Transition guard

Only when current state is PROCESSING (or UNKNOWN for reconciliation) do we accept a webhook update to SUCCEEDED. Use WHERE status IN ('PROCESSING', 'UNKNOWN') and atomic update so only one transition wins.

Idempotency

Same request_id returns the same intent_id and client_secret. Redis SETNX lock prevents duplicate intent creation. Once in SUCCEEDED or FAILED, no further charge is possible.

Concurrency handling

Webhook and reconciliation may both try to update the same intent. Use atomic updates and publish events after DB commit for eventual consistency.


6. Reconciliation Flow

Reconciliation is responsible for cleaning up long-lived PENDING / UNKNOWN intents when:

  • Webhooks are not received
  • Stripe API calls time out
  • PAYMENT_PENDING has lasted longer than a safe threshold
Pending/Unknown intents
Reconciliation Worker
Query Stripe API
Stripe response
Update donation_intents
Publish reconciled event
Final
SUCCEEDEDFAILEDEXPIRED

When Stripe returns processing, the worker only updates next_reconcile_at and retry_count (no event published); see §6.2.

6.1 Reconciliation Sequence (Stripe + RabbitMQ)

6.2 When Stripe Returns processing (Long-Running Payment)

Principle
When Stripe returns `processing`, do **not** mark the intent as failed. Use **exponential backoff** to reschedule checks and a **final TTL** (e.g. 24h); only then mark as **EXPIRED** if still unresolved.

If the Reconciliation Worker queries Stripe and the PaymentIntent is still processing (or in API terms, not yet succeeded or failed), the payment is in a long-running state. Examples: async payment methods (wire transfer, Sofort, SEPA), bank fraud checks, or 3DS opened but not completed. The charge may complete minutes or hours later.


Standard handling: exponential backoff + final TTL

Step 1 — Preserve non-terminal state

Keep local status as PAYMENT_PENDING (or UNKNOWN). Only move to SUCCEEDED, FAILED, or EXPIRED when appropriate.

Step 2 — Exponential backoff reschedule

Set a next check time with increasing intervals (e.g. 5 min → 15 min → 30 min → 1 h → 4 h). Store this in next_reconcile_at on donation_intents. The worker only picks intents where next_reconcile_at <= NOW().

Step 3 — Retry counter

Increment retry_count each time you re-query and still get processing. Use it to compute the next interval and to cap retries or alert if abnormally high.

Step 4 — Final TTL expiration

Define a maximum wait (e.g. 24 hours). If Stripe still returns processing after that, mark the intent as EXPIRED. In practice Stripe usually resolves within 24h; for large amounts you may notify the user or support.


Example logic (Java):

Java
public void reconcileProcessingIntent(DonationIntent intent, PaymentIntent stripeIntent) {
// Check if the intent is still processing
// See Stripe docs: https://stripe.com/docs/payments/payment-intents/lifecycle
if ("processing".equals(stripeIntent.getStatus())) {
System.out.println("Processing intent " + intent.getId());
// TODO: Calculate the next retry time with exponential backoff
// e.g. 5m, 15m, 1h, 4h...
Instant nextCheckTime = calculateNextRetry(intent.getRetryCount());
// Check if we've exceeded the maximum TTL (e.g. 24 hours)
boolean isExpired = intent.getCreatedAt()
.plus(Duration.ofHours(MAX_TTL_HOURS))
.isBefore(Instant.now());
if (isExpired) {
// TODO: Mark as EXPIRED or verify one last time
donationIntentRepository.updateStatus(intent.getId(), IntentStatus.EXPIRED);
} else {
// Update the intent with the new retry time and status
donationIntentRepository.update(intent.getId(), UpdateIntent.builder()
.nextReconcileAt(nextCheckTime)
.retryCount(intent.getRetryCount() + 1)
.lastStripeStatus("processing")
.build());
}
}
}

Why processing happens

  • Async payment methods (wire, Sofort, SEPA, etc.) can take 1–3 business days.
  • Bank delays (extra fraud or compliance checks).
  • 3DS pending (user opened verification but did not complete or close).

Index for the worker

Use a composite index on (status, next_reconcile_at) so each run can efficiently select intents that are due for a check:

SQL
WHERE status IN ('PAYMENT_PENDING', 'UNKNOWN') AND next_reconcile_at <= NOW()

This avoids full table scans and keeps reconciliation bounded to intents that are due for a check. The schema in §8.1 includes retry_count, next_reconcile_at, and last_stripe_status for this flow.


7. Event-Driven Stats Update

After a donation succeeds, downstream updates should be event-driven (stats, notifications, audit):

Consumer responsibilities

  • Deduplicate events (e.g. by intentId + SUCCEEDED).
  • Retry on transient failures with backoff.
  • Maintain eventual consistency between DB, cache, and side effects.

7.1 Summary Table (charity_stats): Real-Time vs Consistency

Principle
Use a dedicated summary table for per-charity totals; populate it with a strategy that balances real-time needs and consistency (e.g. async consumer + atomic updates + periodic calibration).

Why a separate table?

Although you could compute totals with SELECT SUM(amount_cents) FROM donation_intents WHERE charity_id = 1 AND status = 'SUCCEEDED', once donation volume reaches tens or hundreds of thousands of rows, this query becomes slow and can overload the database. A pre-aggregated charity_stats table (see §8.3) gives fast reads for dashboards and APIs.

Schema (reference): charity_id (PK), total_amount_cents, donation_count, updated_at.

How the table is populated — three common approaches:

ApproachDescriptionProsCons
A. In-transaction sync updateIn the same DB transaction that writes the donation/transaction, run UPDATE charity_stats SET total_amount_cents = total_amount_cents + :amount WHERE charity_id = :id.Strong consistency; DB totals are always correct.Row lock contention. A hot charity with many donations per second serializes on that single row; throughput drops.
B. Async message-queue driven (recommended)Webhook only persists the transaction and publishes an event; a Stats Worker (RabbitMQ consumer) runs UPDATE charity_stats ....Peak smoothing. Bursts of donations are processed in the background; the webhook path stays fast.Slight delay (e.g. hundreds of ms), acceptable for dashboards.
C. Redis increment + periodic flushAll real-time increments go to Redis (e.g. HINCRBY); a worker every N minutes flushes Redis deltas into charity_stats.Very low DB write load; good for extreme peaks (e.g. thousands of donations/sec).More moving parts; still need a flush worker and calibration.

Best practice: incremental atomic update

Always update the summary table with atomic in-place addition, not read-then-write.

Wrong (race-prone):

SQL
-- Read then write: another request can overwrite your update
SELECT total_amount_cents FROM charity_stats WHERE charity_id = 1;
-- application: new_total = old + 50
UPDATE charity_stats SET total_amount_cents = 150 WHERE charity_id = 1;

Correct (atomic):

SQL
UPDATE charity_stats
SET total_amount_cents = total_amount_cents + :amount_cents,
donation_count = donation_count + 1,
updated_at = NOW()
WHERE charity_id = :charity_id;

Eventually consistent safety net (reconciliation as calibrator)

No matter which approach you use, the summary table can drift from the transaction table (e.g. due to bugs or partial failures). A Reconciliation Worker should act as a calibrator:

  • Periodically (e.g. daily) run:
    SELECT charity_id, SUM(amount_cents) AS total FROM donation_intents WHERE status = 'SUCCEEDED' GROUP BY charity_id.
  • Compare with charity_stats and UPDATE charity_stats with the computed totals to correct drift.

This keeps the summary table eventually consistent with the source of truth (the transaction/intent table).

Recommendation (summary)

  • Structure: Keep a charity_stats table (§8.3).
  • Population: Have the RabbitMQ consumer (Stats Worker) that already handles donation events asynchronously update charity_stats after updating Redis (approach B).
  • Safety: Use atomic SET total_amount_cents = total_amount_cents + :inc (and same for donation_count).
  • Calibration: Run a daily (or periodic) full SUM from the transaction/intent table and correct charity_stats to handle any residual drift.

7.2 Where to Update Redis: In the Consumer, Not the Webhook

Principle
Redis totals are updated in the message consumer, not in the Webhook. This gives a single writer for stats and eventual consistency without overloading the Webhook path.

Why not update Redis in the Webhook?

  • Distributed transaction risk: If the Webhook updates the DB and then Redis, and Redis fails (e.g. network blip) after you have already returned 200 OK to Stripe, the Redis total is short by that donation until a reconciliation or calibration run. You cannot roll back the 200.
  • Latency and throughput: Stripe holds the HTTP connection until the Webhook responds. Any extra work (Redis, push, etc.) in the Webhook slows the response and limits how many Webhooks you can handle per second.
  • Heavy logic: Even with locks, doing stats and push in the Webhook makes the hot path complex and increases the chance of timeouts and retries.

Standard flow: Webhook as producer, consumer as single writer

StepWebhook (producer)Stats consumer
1Verify signature; persist intent status SUCCEEDED in DB.
2Publish one message to RabbitMQ (e.g. charity_id, amount_cents, intent_id).
3Return 200 OK to Stripe immediately.
4Update Redis (e.g. HINCRBY charity:total:{charity_id} amount_cents).
5Optionally trigger push (WebSocket / batch notification).
6ACK the message only after Redis (and DB if applicable) update succeeds.

If the consumer does not ACK (e.g. Redis is down), the broker will redeliver the message. No 200 has been sent for that work, so there is no “already committed” response to Stripe.

Failure handling in the consumer

  • Redis down: Do not ACK. The queue will retry (e.g. after 5 seconds). After N failed attempts, send the message to a dead-letter queue (DLQ) and alert; a compensation job or operator can replay or fix once Redis is healthy.
  • Duplicate delivery: If the consumer updates Redis but crashes before ACK, the same message can be processed again. Use idempotency: e.g. in Redis, SETNX processed:intent:{intent_id} with a TTL (e.g. 24 hours). If SETNX fails, treat the message as already processed—skip the increment and ACK to avoid double-counting.

Optional: batch aggregation in the consumer

To reduce Redis and push load during bursts:

  • Batch consume: Pull multiple messages at once (e.g. 50).
  • Aggregate in memory: Sum amount_cents per charity_id for that batch.
  • One Redis update per charity: e.g. HINCRBY charity:total:{charity_id} {sum} once per charity in the batch.
  • One push per charity: Send a single WebSocket or notification per charity for the batch.

Summary

Keep the Webhook thin: verify, persist, publish, return 200. All stats (Redis and DB summary table) and optional push run in the Stats Consumer, with retries, DLQ, and idempotency so the system stays consistent and scalable.


8. Data Model

The data model keeps intent state, webhook deduplication, and read-optimized aggregates separated:

8.1 donation_intents

TABLE
donation_intents

One row per user donation intent, used as the durable business anchor for Stripe PaymentIntents.

intent_idUUIDPK
Internal business identifier for the donation intent (also used in URLs and logs).
stripe_payment_intent_idVARCHAR(255)FK
Foreign reference to the Stripe PaymentIntent (`pi_...`).
emailVARCHAR(255)
Donor email address, used for receipts and communication.
charity_idVARCHAR(255)FK
Identifier of the target charity receiving this donation.
amount_centsINTEGER
Donation amount in the smallest currency unit (e.g. cents).
statusVARCHAR(32)
High-level state (CREATED, PAYMENT_PENDING, SUCCEEDED, FAILED, UNKNOWN, EXPIRED).
failure_reasonVARCHAR(255)NULL
Optional machine-readable failure reason when the payment does not succeed.
retry_countINTEGER
Number of reconciliation attempts for this intent (used when Stripe returns processing).
next_reconcile_atTIMESTAMPNULL
When the reconciliation worker should re-check this intent (exponential backoff).
last_stripe_statusVARCHAR(255)NULL
Last status returned from Stripe (e.g. processing) for debugging and backoff logic.
created_atTIMESTAMP
When the intent row was first created.
updated_atTIMESTAMP
Last time the intent row was updated.
Indexes
(status, next_reconcile_at)(status, updated_at)(charity_id, status)

8.2 webhook_events

TABLE
webhook_events

Stores Stripe webhook deliveries for idempotent processing and auditability.

stripe_event_idVARCHAR(255)PK
Unique Stripe event id, used to deduplicate webhook deliveries.
intent_idUUIDFK
Foreign key back to `donation_intents.intent_id`.
received_atTIMESTAMP
When this webhook was first received by the system.
Indexes
(stripe_event_id)(intent_id)

8.3 charity_stats

TABLE
charity_stats

Aggregated donation statistics per charity for fast reads.

charity_idVARCHAR(255)PK
Identifier of the charity (matches the primary key in the charities table).
total_amount_centsINTEGER
Total donated amount for this charity in cents.
donation_countINTEGER
Number of successful donations recorded for this charity.
updated_atTIMESTAMP
Last time the aggregated stats row was updated.
Indexes
(charity_id)

9. Idempotency Strategy

9.1 Confirm Idempotency

  • Gateway-level: Stripe PaymentIntent prevents duplicate charges at the provider boundary.
  • Business-level: internal donation_intents transitions ensure each intent is finalized at most once.

9.2 Webhook Idempotency

  • Storage guard: webhook_events enforces unique stripe_event_id, so the same webhook cannot be applied twice.

9.3 Consumer Idempotency

  • Message guard: consumers deduplicate by business key (e.g. intentId + SUCCEEDED).
  • Replay safety: event replay is supported without double-counting stats.

10. Alternatives Considered

Direct Charge Without Internal Intent

Rejected
Why:
  • Hard to reconcile against Stripe without an internal durable anchor
  • No single place to reason about business-level state

Synchronous Stats Update

Rejected
Why:
  • Slows down the confirmation endpoint
  • Couples the hot payment path to aggregation and reporting

No Reconciliation Worker

Rejected
Why:
  • Webhooks are not guaranteed to be delivered
  • Timeouts and network partitions are inevitable; you need an explicit repair loop

11. Risks & Mitigations

Duplicate Charges

Risk: Users might be charged twice in edge cases.
Mitigation: rely on Stripe PaymentIntent idempotency + enforce a single successful transition per intent.

Lost Webhooks

Risk: Missing webhooks leave intents stuck in PAYMENT_PENDING / UNKNOWN.
Mitigation: Reconciliation Worker periodically re-queries Stripe and finalizes intent state.

Stats Inconsistency

Risk: charity_stats and Redis totals can diverge.
Mitigation: consumer-side deduplication + replay-safe repair path.


12. Monitoring & Metrics

Payment Metrics

  • Success rate
  • Failure rate
  • UNKNOWN rate
  • Stripe latency p95 / p99

System Metrics

  • RabbitMQ queue depth / consumer lag
  • Reconcile backlog
  • DB lock wait time

Alerts

  • UNKNOWN ratio > 2%
  • Spike in Webhook processing errors
  • Reconciliation backlog growing over time

13. SLOs

Use explicit SLOs so failure handling and scaling decisions stay measurable:

  • Payment success availability: >= 99.9% successful intent processing (excluding external issuer declines).
  • Webhook acknowledgment latency: p95 < 2s, p99 < 5s.
  • UNKNOWN ratio: < 2% over rolling 15 minutes.
  • Reconciliation completion window: intents in UNKNOWN/PAYMENT_PENDING are revisited within configured backoff + TTL policy.
  • Stats freshness: dashboard totals converge within an agreed lag window (e.g. <= 60s under normal load).

14. Scaling Strategy

  • API tier: horizontally scale Donation API instances.
  • Payment execution tier: delegate charge-path scaling to Stripe.
  • Queue tier: shard/route RabbitMQ by intentId (routing key or hash).
  • Consumer tier: auto-scale workers from queue depth + processing lag.
  • Read tier: front high-QPS stats endpoints with Redis.

15. Summary

This design:

  • Uses Stripe PaymentIntent in the intended, safe way to avoid duplicate charges.
  • Separates business intents from payment execution; internal state machine is the anchor.
  • Treats webhooks as the authoritative source of payment state.
  • Uses a Reconciliation Worker to handle UNKNOWN states and lost webhooks.
  • Decouples stats and side effects (emails, notifications) via events.
  • Maintains strong end-to-end idempotency guarantees.

16. Interview-Oriented Discussion Questions

If you are using this document to prepare for system design interviews, here are some follow-up questions to challenge your understanding:

  1. Hot path vs. cold path

    • Where is the true hot path in this system?
    • If you had to cut 50% of the complexity, what could you remove from the cold path without violating the SLOs?
  2. Failure modes and trade-offs

    • What happens if Stripe Webhooks are delayed by 30 minutes during peak traffic?
    • How would you surface “eventual success” vs “final failure” back to the user and to internal operations?
  3. Backpressure and rate limiting

    • How would you protect the Donation API from sudden traffic spikes (e.g., a celebrity tweet) without losing valid donations?
    • Where would you place rate limiting and circuit breakers (client, API gateway, Donation Service, Stripe)?
  4. Multi-tenant and per-charity isolation

    • How would you isolate noisy or misbehaving charities so that they do not affect others?
    • Would you shard donation_intents / charity_stats by charity, region, or something else?
  5. Schema and evolution

    • If you later add recurring donations or refunds, how would you extend the current data model and state machine?
    • Which parts of the design are most fragile under such changes?
  6. Cost and observability

    • Which components are likely to dominate your cloud bill (Stripe fees, DB, RabbitMQ, Redis, compute)?
    • What metrics and dashboards would you build first to catch regressions in payment success rate?

Try to answer these questions using the flows, state machine, and data model in this article. In a real interview, you can treat this system as a “pattern” and adapt it to any payment-heavy or intent-based workflow.