Notification System Basics
An Experienced Engineer’s Walkthrough for Backend Engineers
When you add “notify the user” to your product — order paid, someone commented, alert triggered — the first thing many people build is: right after the order is paid (or the comment is saved), call an email API or push API and send the notification. That works in a demo. In production, that choice often leads to slow requests, failed orders when the notification provider is down, and a mess of retries and rate limits mixed into your core business logic. This article is written as if I’m sitting next to you: we’ll go through why to decouple notifications from the main flow, how an event-driven notification pipeline works step by step, how routing and channels fit in, and how to handle failures and idempotency so the system is robust and understandable. No prior experience with message queues or multi-channel notifications is assumed.
Lesson 1: Why “Send It in the Same Request” Breaks Down
The Naive Approach Most People Try First
In the naive design, when the user clicks “Pay” and the payment succeeds, your code does something like:
- Process payment.
- Update order state to “paid.”
- Call the email service (or SMS, or push) to send “Your order is paid.”
- Return the response to the user.
So the same request that completes the order also triggers the notification. The user (or the client) waits until step 3 finishes. That seems simple. So what goes wrong?
What Actually Happens in Production
- The notification channel is slow or down. The email provider might take 2–3 seconds to accept the message, or the SMS gateway might be timing out. Your order request then takes 2–3 seconds (or fails) even though the payment and DB update already succeeded. From the user’s point of view, “Pay” is slow or fails because of email, which has nothing to do with paying. As an experienced engineer, I never want a secondary action (notification) to block or fail the primary action (order paid).
- Coupling: If the notification call throws (e.g. gateway returns 5xx), you have to decide: do we still return “order paid” to the user? If yes, you must be careful not to fail the whole transaction because of the notification. If no, then order creation can fail because email failed — which is the wrong dependency. So you’re forced into awkward error handling and retries inside the main flow.
- Scaling and retries: Sending email/SMS/push is I/O-bound and often has rate limits (e.g. 100 SMS per second per account). If you mix sending into the main flow, every order request holds a connection to the notification provider. Retrying a failed send usually means retrying from the application — and you don’t want to retry the whole “pay order” request just to retry the email. So you end up with ad-hoc queues or background jobs anyway. Better to design the boundary clearly from the start.
So the core idea is: decouple “something happened” from “user was notified.” The main flow only emits an event (“order paid,” “comment added”); a notification service consumes that event and is responsible for building the message, choosing the channel, sending, and retrying. The main flow never waits for the notification to be delivered.
Lesson 1 Takeaway
Notifications should be asynchronous and decoupled from the main business flow. Use a message queue (or event bus): main flow publishes an event and returns; a notification service consumes and sends. That way latency and failures in notification do not affect the main flow, and you can scale and retry notification independently.
Lesson 2: The End-to-End Flow — Event to Delivery
Before we talk about routing or channels in detail, it helps to walk through the pipeline once so you see who does what.
Step by Step: From “Order Paid” to “User Sees Notification”
- Main flow (e.g. order service): When the order is paid, it publishes an event to a message queue. The event might look like:
{ event_type: "order_paid", user_id: "123", order_id: "ord_456", amount: 99.00, ... }. The main flow does not wait for any consumer. It just publishes and returns. So the user gets “Payment successful” immediately; they don’t wait for email or push. - Message queue: The event is stored (e.g. in Kafka, RabbitMQ, SQS). It will be delivered to one or more consumers. The queue gives you durability (if the consumer crashes, the message is still there) and backpressure (if the consumer is slow, the queue buffers).
- Notification service (consumer): It consumes the event. Its job is: figure out who to notify (from
user_idor payload), what to send (template + variables), and which channel(s) to use (in-app, email, SMS, push). Then it calls the right channel adapters (e.g. SMTP client, Twilio API, FCM/APNs). For each channel, it respects rate limits and retries on failure; after max retries, it can send the message to a dead-letter queue (DLQ) for inspection. - Channels: Each channel has different APIs and limits. Email: SMTP or an API (SendGrid, SES, etc.). SMS: a provider (Twilio, etc.) with cost and rate limits. Push: FCM (Android) and APNs (iOS). The notification service talks to these via adapters so the rest of the code sees a uniform “send to user” interface.
- Status: The service records whether each send succeeded or failed (e.g. in DB or logs). That helps with debugging (“why didn’t the user get the email?”) and analytics.
So the pipeline is: event → queue → consumer → routing + templates → adapters → channels. The main flow never sees the queue or the channels; the notification service never runs business logic (e.g. it doesn’t know how to “pay an order”), it only knows how to turn events into messages and send them.
Event to Delivery Pipeline
Main flow
- Publish event (e.g. order_paid, user_id, order_id, payload) to message queue.
- Do not wait for consumers; return immediately.
Message queue
- Durable store (Kafka, RabbitMQ, SQS). Delivers to notification service consumer.
Notification service (consumer)
- Consume event; resolve who to notify, template, and channel(s).
- Call channel adapters (email, SMS, push); respect rate limits and retry.
- Record send status; DLQ after max retries.
Notes
- Main flow never calls notification APIs; decoupling keeps payment (or other primary action) fast and independent.
Sequence Diagram
High-Level Architecture
As a newcomer, a common mistake is to put “which channel” or “what to send” inside the main flow (e.g. “if user prefers email, call email API”). That keeps the coupling: the main flow still depends on notification logic. The right split is: main flow only publishes facts (event type + ids + payload); the notification service owns all decisions about templates, channels, and sending.
Lesson 2 Takeaway
The pipeline is event → queue → consumer → routing → adapters → channels. The main flow only publishes events; the notification service owns templates, routing, and delivery. State and failure handling live in the notification service and queue, not in the main flow.
Lesson 3: Routing — Who Gets What, and Through Which Channel
“Routing” means: for this event and this user, which channel(s) do we use (in-app, email, SMS, push), and what content do we send? A few things matter.
User Preferences
Users often have settings like “in-app only,” “no email,” “SMS only for critical.” So when we build the notification, we load the user’s preferences (from DB or cache) and filter channels: if the user disabled email, we don’t send email for this event. If “SMS for critical only,” we might send SMS only for events marked critical (e.g. password reset, payment failed) and use email or in-app for the rest. As a beginner, it’s easy to forget preferences and send to all channels; then users get duplicate or unwanted notifications and complain.
Channel Capability
The user might not have a push token (e.g. they haven’t opened the app on this device) or might not have a phone number on file. So we should check capability before trying to send: don’t call the SMS API if there’s no phone number; don’t call FCM if there’s no token. Otherwise you get unnecessary failures and wasted cost.
Content and Template
Different channels have different constraints. Email can be long (subject + body, HTML). SMS is short (e.g. 160 characters). Push is usually a short title + body. So we typically have templates per (event_type, channel) and substitute variables (user name, order id, amount, etc.). For example: event order_paid → email template “Hi {{name}}, your order {{order_id}} for {{amount}} has been paid.” The notification service loads the template, fills in the variables from the event payload, and sends. As a newcomer, keep templates in one place (e.g. DB or config) and avoid hardcoding message strings in code; that makes it easier to change copy and support i18n later.
Idempotency — Why It Matters
The message queue might deliver the same event more than once (at-least-once delivery). If we don’t deduplicate, the user can get two emails for the same order. So we need idempotency: for a given event (e.g. identified by event_id or order_id + event_type), we send at most once per channel. We can do that by: (1) checking “have we already sent for this event_id?” before sending, or (2) using an idempotency key when calling the channel API (if the provider supports it). As an experienced engineer, I always design for “consumer may run twice”; idempotency is not optional.
Lesson 3 Takeaway
Routing = user preferences + channel capability + content type. Templates = per event type and channel, with variable substitution. Idempotency = deduplicate by event_id (or equivalent) so duplicate consumption does not cause duplicate notifications.
Lesson 4: Failure Handling — Retry, DLQ, and Degradation
Channels fail: the email provider times out, the SMS gateway returns 503, or the push service is rate-limiting you. So the notification service must retry, dead-letter after max retries, and degrade so one channel’s failure doesn’t block others.
Retry with Backoff
For transient failures (timeout, 5xx), we retry a few times with backoff (e.g. 1s, 2s, 4s). That avoids hammering the provider and often succeeds on the next try. We don’t retry forever: after N attempts (e.g. 3), we give up for this message and move it to a dead-letter queue (DLQ). Someone (or a process) can later inspect the DLQ and fix the cause (e.g. wrong API key, invalid phone number) and replay or drop.
Rate Limits
SMS and push providers often have rate limits (e.g. 100 per second per account, or per user). So we rate-limit our own sends: e.g. per user per channel “at most 1 SMS per minute” for non-critical, or use a global limiter for the provider. If we exceed the provider’s limit, we get 429 or throttling; handling that with retry-after or a queue per channel is common. As a newcomer, it’s easy to ignore rate limits and then see a lot of 429s and blocked sends; build rate limiting into the notification service from the start.
Degradation
If one channel is down (e.g. SMS gateway is failing), we should still send via other channels (e.g. email, in-app). So the notification service should not fail the whole event when one channel fails; it should try each channel and record success/failure per channel. Optionally, we can have a fallback rule (e.g. “if SMS fails, send email instead for critical events”).
Summary Table
| Point | Description |
|---|---|
| Async | Events to MQ; do not block main flow |
| Template | Pick by event type and channel; variable substitution |
| Rate limit | SMS and push have cost and frequency limits; limit by user and channel |
| Retry | Retry on channel failure; send to DLQ after max attempts |
| Status | Track delivered / failed for debugging and analytics |
Lesson 4 Takeaway
Retry with backoff for transient failures; DLQ after max retries. Rate limit per channel so you don’t hit provider limits. Degrade so one channel’s failure doesn’t block others; continue with other channels and record status per channel.
Key Rules (Summary)
- Decouple main flow from notifications; use MQ; avoid notification failure affecting main business.
- Rate limit and degrade by channel; when one channel fails, others continue.
- Idempotent: Deduplicate by event_id or use idempotent send so duplicate events do not cause duplicate notifications.
What's Next
See MQ Basics, Idempotency, Rate Limiting. See Retry/DLQ for failure handling.