Rate Limiter Design

An Experienced Engineer’s Walkthrough for Backend Engineers

When you first hear “rate limiter,” it’s easy to think: “So we count requests and reject when we hit a number. How hard can that be?” In principle, that’s right. In practice, the choice of algorithm affects whether you get “double traffic at the boundary,” whether you allow bursts or not, and how fair the system feels. And the placement of the limiter (gateway vs app, per user vs per IP) and implementation details (e.g. atomicity in a distributed setup) often bite newcomers. This article is written as if I’m walking you through it: we’ll see why rate limiting is non-negotiable for public or multi-tenant systems, where to put it, which algorithm fits which case, and what to get right in implementation. No prior experience with rate limiters is assumed.


Lesson 1: Why “No Rate Limit” Is Not an Option

What Happens Without a Limiter

Imagine your API serves many clients. One of them has a bug (e.g. a retry loop with no backoff) or is malicious. That single client can send a huge number of requests per second. Without a rate limiter:

  • Other clients get starved. Your server (or a critical downstream like the DB or a payment provider) has finite capacity. One client can consume most of it; others see timeouts or errors. So you need a way to cap how much one actor can use.
  • A sudden spike can take the system down. Even if no one is malicious, a traffic spike (e.g. a viral link, or a batch job misconfigured) can push load beyond what the system can handle. Rate limiting is one of the main levers to keep load within capacity.
  • You have no way to enforce quotas. If you offer different tiers (e.g. free vs paid), or if you need to cap usage per customer for cost control, you need to measure and limit per key (user, API key, etc.). A rate limiter is the component that enforces “at most X requests per minute for this key.”

So as an experienced engineer, I treat rate limiting as mandatory for any API that is public or shared across many tenants. The question is never “should we have it?” but “where do we put it, what key do we use, and which algorithm do we use?”

What a Rate Limiter Actually Does

A rate limiter answers one question: “Is this request allowed, given the history for this key?” The “key” might be:

  • User ID (if the user is authenticated) — “this user can make at most 100 requests per minute.”
  • IP address — often used when the user is not logged in; be aware that many users can share an IP (e.g. NAT, office) so IP-based limiting can be coarse.
  • API key — for programmatic access; each key gets a limit.
  • Combination — e.g. “per user, but also per IP for unauthenticated traffic.”

If the request is within the limit, the limiter lets it through (and typically updates the count or token state). If over the limit, the limiter rejects the request (e.g. HTTP 429) so that the protected system (your app, DB, or payment provider) never sees it. That gives you:

  • Stability: Load stays within what the system can handle.
  • Fairness: No single client can monopolize capacity.
  • Cost and quota control: You can enforce per-customer or per-endpoint limits.

Lesson 1 Takeaway

Rate limiting is not optional for public or multi-tenant systems. It protects stability and fairness and gives you quota levers. The real design work is: where to limit, what key to use, and which algorithm to use — and then implementing it correctly (e.g. atomically in a distributed setup).


Lesson 2: Where the Limiter Sits — Gateway vs App, and What Key to Use

Before we dive into algorithms, it helps to see where the limiter lives in the request path and what we’re counting.

Request Path — Limiter in Front of the Work

Conceptually, every request that should be limited goes through the limiter first. The limiter decides allow or deny; only allowed requests reach your application (and thus the DB, caches, or external services). So the limiter is in front of the work you want to protect.

As a newcomer, a common mistake is to put the limit check inside the application after some expensive work (e.g. after parsing the body or after a DB call). Then the expensive work still runs for every request; the limiter only saves you from “too many responses,” not from “too much work.” So the limiter should run as early as possible — ideally at the gateway or at the very entry of your app — so that over-limit requests are cheap to reject (e.g. one in-memory or Redis check).

Gateway vs In-App

  • Gateway (or load balancer): The limiter runs at the edge, before requests hit your app servers. It protects all backends and keeps over-limit traffic off your app entirely. This is great for “total QPS” or “per IP” limits. The downside is that the gateway often doesn’t know user ID (unless you pass a token and the gateway can resolve it), so fine-grained “per user” limits are sometimes done in the app.
  • In-app: The limiter runs inside your service. You have full context (user ID, endpoint, etc.), so you can limit “per user,” “per user per endpoint,” or “per API key.” The downside is that the request has already reached your app (and possibly passed through a load balancer), so you’re spending a bit more per request before rejecting. In practice, many systems do both: gateway for coarse global or per-IP limits, app for per-user or per-endpoint limits.

Layered Limits

A common pattern is to layer limits:

  • Global: Total QPS (or RPS) to the service. Prevents the whole system from being overwhelmed.
  • Per user (or per API key): So one user cannot consume the whole capacity.
  • Per endpoint (optional): So one expensive endpoint cannot starve others.

Each layer can use a different limit and even a different algorithm. For example: global fixed window, per user token bucket. As a beginner, start with one layer (e.g. per user) and add layers as you see real traffic and abuse patterns.

Architecture Sketch

Lesson 2 Takeaway

The limiter sits in front of the work you want to protect, so over-limit requests are rejected cheaply. You can run it at the gateway (coarse, protects everything) or in-app (fine-grained by user/endpoint), or both. Choose the key (user_id, IP, API key) and layers (global, per user, per API) based on what you need to protect and what quotas you want to enforce.


Lesson 3: Four Common Algorithms — and the Traps Beginners Hit

“Count requests and reject when over limit” can be implemented in several ways. The difference shows up at window boundaries (e.g. does the limit “reset” and allow a burst?) and under bursts (e.g. do we allow a short burst or force smooth traffic?). Choosing the wrong algorithm can make your limit feel too loose (e.g. 2× traffic at boundaries) or too strict (e.g. no burst when the product expects it).

Fixed Window

Idea: You have a window of fixed length (e.g. 1 second, or 1 minute). You count requests in that window. When the window ends, you reset the counter and start a new window.

Pros: Very simple to implement and explain. One counter per key; increment on request; if count > limit, reject; when the clock crosses the window boundary, reset.

Cons: Boundary burst. Suppose the limit is 100 requests per minute, and the window is “minute 1” and “minute 2.” A client can send 100 requests in the last second of minute 1 and 100 requests in the first second of minute 2. So in 2 seconds they sent 200 requests — twice the intended rate. For strict “at most 100 per minute” semantics, fixed window is wrong at the boundary. For loose limits (e.g. “roughly 1000 per minute”) it’s often acceptable.

When to use: When limits are loose or when a 2× burst at boundaries is acceptable (e.g. internal tools, or high limits).

Sliding Window

Idea: Instead of “this minute” or “this second,” you count requests in a rolling window: “the last 60 seconds” (or last N seconds) from now. So the window moves with time; there’s no sudden “reset” that allows a burst.

Pros: Smoother and more accurate. No 2× burst at boundaries. Better matches the intuition “at most 100 requests in any 60-second period.”

Cons: Slightly more complex. You need to store timestamps of recent requests (or use a structure that represents “count in last N seconds”), and possibly evict old entries. Or you use an approximation (e.g. weighted average of previous window and current window) to avoid storing every timestamp.

When to use: When you need precise limiting and boundary bursts are not acceptable (e.g. public API with strict quotas).

Token Bucket

Idea: You have a bucket that holds tokens. The bucket has a capacity (e.g. 100 tokens). Tokens are refilled at a fixed rate (e.g. 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected. So when the bucket is full, the client can send a burst (e.g. 100 requests at once); then the bucket drains and refills at the steady rate.

Pros: Allows bursts (good for UX: user can do a few quick actions), while still capping the average rate. Very common for API rate limits (e.g. “100 requests per minute” with a burst of 10).

Cons: You must track tokens and last refill time and refill correctly (e.g. new_tokens = min(capacity, current_tokens + (now - last_refill) * rate)). In a distributed setup (e.g. Redis), you need atomic “refill then deduct” (e.g. Lua script); otherwise two requests can both see “1 token left” and both pass.

When to use: When you want to allow short bursts but cap average rate (e.g. API QPS limits, user actions per minute).

Leaky Bucket

Idea: Requests “enter” a bucket; the bucket drains at a fixed rate (like a leak). If requests arrive faster than the drain rate, they queue (or are rejected, depending on variant). So output is smooth: at most N requests per second leave the bucket, no burst.

Pros: Smooth output; good when the downstream (e.g. a payment provider or a queue) cannot handle bursts.

Cons: No burst allowance; if the user sends 10 requests in 1 second and the drain is 1 per second, they may see delays or rejections. Can feel “sluggish” for bursty but legitimate traffic.

When to use: When you need strictly smooth output (e.g. rate-limiting calls to an external API that dislikes bursts, or evening out traffic into a queue).

Comparison Table

AlgorithmProsConsBest for
Fixed windowSimple2× at boundaryLoose limits
Sliding windowSmooth, no boundary burstMore complexPrecise limiting
Token bucketAllows burstToken bookkeeping, need atomicityAPI, QPS limit
Leaky bucketSmooth outputNo burstFlow control, queues

Lesson 3 Takeaway

  • Simple counting, loose limits → fixed window is fine.
  • Strict “at most N in any window” → sliding window.
  • Allow bursts, cap average → token bucket (and implement refill + deduct atomically in distributed case).
  • Strictly smooth output → leaky bucket.

As a newcomer, the most common mistake is using fixed window when you thought you had “100 per minute” but actually get 200 in 2 seconds at the boundary. If your product or SLA cares about that, switch to sliding window or token bucket.


Lesson 4: Implementation — Atomicity, 429, and Monitoring

Why Atomicity Matters (Distributed Limiter)

When the limiter state lives in Redis (or another shared store), multiple app instances (or multiple threads) can handle requests for the same key at the same time. If you do “read count, if < limit then increment, return allow,” two requests can both read “99” and both increment to 100 and both allow — so you get 101 requests instead of 100. So the “check and update” must be atomic. In Redis, that usually means:

  • A Lua script that runs on the server: refill tokens (or update window), deduct one (or add one to count), return allow/deny. The whole script runs atomically.
  • Or a transaction (MULTI/EXEC) if the operations can be expressed that way.

Without atomicity, your limit is wrong under concurrency. As a beginner, this is easy to miss when testing with a single client; under load with many concurrent requests, the limit can be exceeded.

What to Return When Over Limit

HTTP 429 Too Many Requests is the standard status. Include Retry-After (in seconds, or as an HTTP-date) so well-behaved clients know when they can retry:

Http
HTTP/1.1 429 Too Many Requests
Retry-After: 60

Optionally, in the response body, you can return a JSON with retry_after_seconds or a message. This improves UX and reduces “client retries immediately and hammers your limiter again.”

Layered Limiting in Practice

Implement layers in order: e.g. first check global limit, then per user, then per endpoint. If any layer rejects, return 429. This way you protect the system at multiple granularities; tuning (e.g. “global 10k RPS, per user 100 RPS”) depends on your capacity and product needs.

Monitoring

  • Count how often limits are hit (e.g. 429s per key or per endpoint). If one user or one endpoint dominates, you may need to adjust limits or investigate abuse.
  • Which keys hit the limit (e.g. top user_ids or API keys). Useful for debugging (“why is this user seeing 429?”) and for tuning quotas.
  • Use this data to tune limits and to detect misbehaving clients or bugs (e.g. retry loops).

Lesson 4 Takeaway

Atomicity (e.g. Lua in Redis) is required for correct distributed rate limiting; otherwise concurrent requests can exceed the limit. Return 429 and Retry-After when over limit so clients can back off. Monitor 429 rate and which keys hit the limit so you can tune and debug.


Key Rules (Summary)

  • Atomicity: Use Lua or Redis transactions for distributed rate limiting so “check and update” is atomic.
  • Response: Return 429 and Retry-After when over limit.
  • Monitor: Track how often limits are hit and which keys (users/APIs) are limited; use this to tune limits and debug.

What's Next

See Redis Rate Limiting, High Concurrency Toolkit. See API Gateway for centralized rate limiting.