Skip to main content

Technical

How to Rate-Limit an API

Design API rate limiting that protects the system without breaking legitimate use — algorithms, scopes, response headers, and per-tier policies.

Category Technical
Read Time 8 min read
Updated May 2026
Steps Guide

This guide is for developers and technical leads building an API that needs to defend itself against abuse, accidental overuse, and the noisy client that consumes all the capacity. By the end you will know which rate limiting algorithm to pick, how to scope limits to make them useful, what headers to return so well-behaved clients can self-regulate, and how to test the limits before users hit them in production.

Who This Guide Is For

Developers building public or partner-facing APIs, internal APIs that need protection from misbehaving clients, or any service where one consumer can degrade the experience for everyone else. The patterns scale from small private APIs to large public ones; the design decisions are similar at any scale.

Before You Start

You should know what you are protecting against. Three different threats produce three different rate limit designs: defending against malicious abuse (DDoS-adjacent traffic, credential stuffing), defending against accidental overuse (a buggy client in a tight loop), and shaping traffic to match business tiers (free tier gets 100 requests an hour, paid gets 10,000). The technical mechanism is similar; the limits and policies are different.

You should also know your traffic shape. The peak hour, the typical request volume per client, and the worst-case behaviour you have already seen. Designing limits without this data produces either limits that are too tight (cutting off legitimate use) or too loose (failing to defend).

Pick the Right Algorithm

Three rate limiting algorithms cover almost every case. The differences matter.

  • Fixed window: count requests in a fixed time window (per hour, per minute). Simple, but allows burstiness at window boundaries — a client can make their full hour’s quota in the last second of one hour and the first second of the next.
  • Sliding window: count requests in a rolling window. More even rate limiting, slightly more expensive to compute.
  • Token bucket: clients have a “bucket” that refills at a fixed rate. They consume tokens per request. Allows for short bursts (up to the bucket size) while limiting sustained rate. Best for APIs where occasional bursts are legitimate.

For most APIs, sliding window or token bucket is the right choice. Fixed window is the easiest to implement but produces predictable abuse patterns where attackers exploit the window boundary. Token bucket is the most user-friendly because legitimate users with occasional spikes are not punished.

The token bucket implementation in Redis is concise: store the bucket as a record per client with two fields — current token count and last refill timestamp. On each request, compute how much time has passed since the last refill, add tokens at the refill rate up to the bucket size, deduct one token, and return whether the request is allowed. The arithmetic is small; the Redis operations are atomic; the result is fast and fair.

Scope the Limits Carefully

The unit you rate-limit on decides what kind of behaviour you allow and disallow. Common scopes:

  • Per IP address: the simplest. Works for unauthenticated endpoints and protects against the obvious abuse. Falls down when many legitimate users share an IP (an office, a corporate VPN, a mobile network NAT).
  • Per API key or user: more accurate for authenticated APIs. The client identifies themselves, and the limits apply to that identity regardless of where the requests come from.
  • Per endpoint: limits differ by what the endpoint does. The endpoint that creates new accounts has a much lower limit than the one that returns a cached list. Allows tight protection on expensive operations without restricting normal use.
  • Compound: combinations of the above. “100 requests per minute per API key, 1000 requests per minute per IP, 10 account-creation requests per hour per IP.”

The right scope depends on the endpoint and the risk. A login endpoint should be rate-limited per IP and per username — the per-IP limit defends against scraping, the per-username limit defends against credential stuffing on a single account. A read-only data endpoint can be rate-limited per API key and not worry about IP at all.

A concrete example. A client API we built had three tiers of limits. Public endpoints: 60 requests per minute per IP. Authenticated endpoints: 600 requests per minute per API key. The expensive search endpoint: 30 requests per minute per API key, plus a global cap of 1000 per minute across all clients to protect the database. The three layers each defend against a different threat, and a well-behaved client never sees any of them.

Return the Right Headers

A rate-limited request returns 429 Too Many Requests. That much is standard. The headers around it are what allows good clients to behave well.

The headers that matter:

  • X-RateLimit-Limit: the total quota in the current window
  • X-RateLimit-Remaining: how many requests the client has left in the window
  • X-RateLimit-Reset: when the quota resets (timestamp or seconds remaining)
  • Retry-After: on a 429 response, how long the client should wait before retrying

Returning these on every response — not just 429s — lets the client see their consumption in real time and self-regulate. A client library that respects these headers will throttle itself, never hit 429s, and never punish your API. A client that ignores them gets exactly what they deserve.

The naming is partly historical. Stripe uses X-RateLimit-*, GitHub uses X-Ratelimit-* (different casing), some APIs use RateLimit-* (the standardised form). Whichever you pick, document it clearly. The client cannot respect headers they cannot find.

Handle Bursts Without Punishing Them

The naive rate limit treats every request as equal weight. The reality is that some operations are expensive (a complex search, an export) and some are cheap (a status check, a cached read). A rate limit that does not distinguish punishes cheap operations for the cost of expensive ones, or fails to protect against the abuse of expensive ones.

The fix is weighted rate limiting. Each endpoint costs a different number of tokens. The status check costs 1; the search costs 10; the export costs 50. A client’s bucket holds 1000 tokens. They can make many cheap requests or a few expensive ones; the limit reflects actual resource consumption.

This requires deliberate cost assignment. Profile the endpoints, see what they actually cost in database time or compute, and weight accordingly. The assignment does not need to be perfect — order-of-magnitude correctness is enough. The point is to make expensive operations cost more, so the rate limit corresponds to actual capacity.

Decide What to Do When the Limit Is Hit

The default behaviour is to return 429 immediately and let the client retry. This is fine for most APIs. Some specific cases benefit from different handling.

If the request is idempotent and the load is brief, queueing for a short period (a few seconds) and serving when capacity returns can produce a better client experience than failing. The client gets a successful response, and the system handles the burst.

If certain endpoints are critical for the client’s operations, you can rate-limit them more loosely or exempt them entirely. A client whose API key is rate-limited might still be allowed to call the authentication endpoint, the rate limit status endpoint, or the support contact endpoint — so they can diagnose what is happening.

If the rate limit is being hit consistently, the response can include a hint about what the client should do: “You have exceeded your tier’s limit. Reduce request frequency or upgrade your tier at /pricing.” This turns a frustrating 429 into a constructive signal.

Test the Limits Before Users Do

Rate limit configuration is one of those areas where the consequences of a bug appear only at scale. The fix is to test deliberately.

A real example. A team set a rate limit at “1000 requests per minute per API key” thinking that was generous. A client’s nightly batch job — running for years before the limit was added — produced 1500 requests in the first ten seconds of the night. The job started failing in production, no warning, and the client’s pipeline broke. A rate limit test against expected client behaviour would have caught this.

The pattern that works: simulate the real client behaviour against your limits before deploying. The batch job, the high-volume integration, the mobile app retry storm — run them against the rate-limited API in staging and see what happens. If anything legitimate gets rate-limited, adjust the limits or the client.

When deploying changes to rate limits, deploy in monitor-only mode first. Log what would have been rate-limited, but do not actually reject. Look at the logs for a few days. Confirm only the traffic you want to limit is being caught. Then turn on enforcement.

Common Mistakes

  • Fixed window only. Allows clients to burst at window boundaries; predictable abuse pattern.
  • Rate-limiting on IP for authenticated APIs. Shared IPs (offices, corporate NATs) get unfairly penalised. Use the API key or user ID.
  • No headers returned. Well-behaved clients have no way to self-regulate. They hit 429s by accident.
  • Uniform cost across endpoints. Cheap endpoints pay for the cost of expensive ones; expensive ones are not protected. Use weighted limits.
  • No 429 retry logic on the client side. When deploying clients, the client should respect Retry-After and back off. Otherwise it hammers the API and makes the situation worse.
  • Limits set without traffic data. Either too tight (cuts off legitimate users) or too loose (fails to defend). Base the limits on real traffic shape.
  • Deploying enforcement without monitor-only mode. Surprises legitimate users with broken integrations.
  • Treating rate limit as the only defence. Rate limit defends against volume. It does not defend against well-paced abuse, credential stuffing at low rates, or injection attacks. Rate limit is one layer, not the layer.

What Good Looks Like

A well-designed API rate limit uses sliding window or token bucket, scopes limits to a mix of IP, API key, and endpoint as appropriate, weights expensive endpoints higher, returns X-RateLimit headers on every response, returns clear 429s with Retry-After when limits are hit, has been tested against real client behaviour before going live, and is monitored so unusual rate limit activity surfaces in dashboards. Legitimate clients almost never hit the limits because the headers let them self-regulate; abusive or buggy clients hit them quickly and predictably, protecting the rest of the system.

Next Steps

If the API needs authentication and authorisation as well, How to Secure an API covers that side. If the API exposes data from a multi-tenant database, How to Design a Multi-Tenant Database covers the underlying architecture. For building the broader API platform, see API Development.

Need Hands-On Help?

Our guides give you the thinking. If you want someone to do the building, we should talk.

Start a Project Browse Case Studies