API rate limiting
Limit at the gateway or app layer by user, tenant, IP, or API key; pick algorithms and storage (in-process, Redis); standardize 429 responses and Retry-After.
This page gives Agents a complete API rate limiting reference: Redis-based token bucket algorithm, sliding window algorithm, differentiated configuration for different endpoints (login vs query), standard response headers (X-RateLimit-*), and client-side exponential backoff code.
The SKILL should explain the trade-offs of token bucket (allows bursts), leaky bucket (smooth output), and sliding window; in distributed scenarios use Redis Lua scripts for atomicity, avoiding the race condition in INCR + EXPIRE.
Document exponential backoff, jitter, and idempotency keys in client documentation; layer with circuit breaking — rate limits protect resources, breakers isolate failures.
Rate-limit decision flow
[ Ingress: derive key user / tenant / ip / api_key ]
│
▼
[ Storage: in-process counter / Redis / gateway plugin ]
│
▼
[ Algorithm: token bucket (bursty) / leaky bucket / sliding or fixed window ]
│
┌────────┴────────┐
▼ ▼
[ Allow: deduct quota ] [ Deny: 429 Too Many Requests ]
│ │
│ ├── Retry-After: seconds or HTTP-date
│ ├── RateLimit-* / X-RateLimit-* (per team standard)
│ └── body: machine-readable code + human text
▼
[ Metrics: reject rate, hot keys, quota remainder sampling ]
Token bucket and sliding window Redis implementation
Token bucket algorithm (Redis Lua script, atomic operations):
// ratelimit/tokenBucket.ts
import { Redis } from 'ioredis'
const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1]) -- bucket capacity (max burst)
local rate = tonumber(ARGV[2]) -- refill rate per second
local now = tonumber(ARGV[3]) -- current timestamp (ms)
local cost = tonumber(ARGV[4]) -- tokens consumed by this request
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Calculate tokens to refill
local elapsed = (now - last_refill) / 1000
local refill = elapsed * rate
tokens = math.min(capacity, tokens + refill)
if tokens < cost then
-- Not enough tokens, deny
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) + 1)
return {0, math.ceil((cost - tokens) / rate * 1000)}
end
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) + 1)
return {1, tokens}
`
export async function tokenBucketAllow(
redis: Redis,
key: string,
options: { capacity: number; rate: number; cost?: number }
): Promise<{ allowed: boolean; remaining: number; retryAfterMs: number }> {
const [allowed, value] = await redis.eval(
TOKEN_BUCKET_LUA, 1, key,
options.capacity, options.rate, Date.now(), options.cost ?? 1
) as [number, number]
return {
allowed: allowed === 1,
remaining: allowed === 1 ? value : 0,
retryAfterMs: allowed === 0 ? value : 0,
}
}
Sliding window algorithm (Redis Sorted Set implementation):
// ratelimit/slidingWindow.ts
export async function slidingWindowAllow(
redis: Redis,
key: string,
limit: number, // max requests in window
windowMs: number // window size (milliseconds)
): Promise<{ allowed: boolean; remaining: number }> {
const now = Date.now()
const windowStart = now - windowMs
const pipeline = redis.pipeline()
// Remove old records outside the window
pipeline.zremrangebyscore(key, '-inf', windowStart)
// Count requests in current window
pipeline.zcard(key)
// Add current request (score = timestamp, member = unique ID)
pipeline.zadd(key, now, `${now}-${Math.random()}`)
// Set key expiry (auto-cleanup)
pipeline.pexpire(key, windowMs)
const results = await pipeline.exec()
const count = (results![1][1] as number)
if (count >= limit) {
// Over limit, remove the just-added record
await redis.zpopmax(key)
return { allowed: false, remaining: 0 }
}
return { allowed: true, remaining: limit - count - 1 }
}
Differentiated rate limit configuration for different endpoints:
// middleware/rateLimit.ts
import { Request, Response, NextFunction } from 'express'
const RATE_LIMIT_CONFIGS = {
// Login: strict rate limiting to prevent brute force
login: { capacity: 5, rate: 1/60, windowMs: 15 * 60_000 },
// Query: relaxed, allows bursts
query: { capacity: 100, rate: 10, windowMs: 60_000 },
// Write operations: moderate
write: { capacity: 30, rate: 5, windowMs: 60_000 },
// File upload: strict
upload: { capacity: 10, rate: 1/60, windowMs: 60_000 },
}
export function rateLimit(type: keyof typeof RATE_LIMIT_CONFIGS) {
return async (req: Request, res: Response, next: NextFunction) => {
const config = RATE_LIMIT_CONFIGS[type]
const key = `rl:${type}:${req.user?.id ?? req.ip}`
const result = await tokenBucketAllow(redis, key, config)
// Standard response headers
res.set({
'X-RateLimit-Limit': String(config.capacity),
'X-RateLimit-Remaining': String(result.remaining),
'X-RateLimit-Reset': String(Math.ceil((Date.now() + result.retryAfterMs) / 1000)),
})
if (!result.allowed) {
res.set('Retry-After', String(Math.ceil(result.retryAfterMs / 1000)))
return res.status(429).json({
type: 'https://api.example.com/problems/rate-limited',
title: 'Too Many Requests',
status: 429,
retryAfter: Math.ceil(result.retryAfterMs / 1000),
})
}
next()
}
}
Rate limit response headers and client backoff
Standard response headers (X-RateLimit-* series):
# 429 response example (with complete rate limit headers)
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
X-RateLimit-Limit: 100 # quota ceiling within window
X-RateLimit-Remaining: 0 # remaining count
X-RateLimit-Reset: 1744380060 # Unix timestamp (window reset time)
Retry-After: 47 # suggested wait in seconds (RFC 7231)
{
"type": "https://api.example.com/problems/rate-limited",
"title": "Too Many Requests",
"status": 429,
"detail": "API rate limit exceeded. Limit: 100 requests per minute.",
"retryAfter": 47
}
Client-side exponential backoff (with jitter):
// utils/fetchWithRetry.ts
interface RetryOptions {
maxRetries?: number
baseDelayMs?: number
maxDelayMs?: number
jitter?: boolean
}
async function fetchWithRetry(
url: string,
init?: RequestInit,
options: RetryOptions = {}
): Promise<Response> {
const { maxRetries = 3, baseDelayMs = 1000, maxDelayMs = 30000, jitter = true } = options
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await fetch(url, init)
if (response.status !== 429) return response
if (attempt === maxRetries) return response
// Prefer server Retry-After header
const retryAfter = response.headers.get('Retry-After')
let delayMs: number
if (retryAfter && /^\d+$/.test(retryAfter)) {
delayMs = parseInt(retryAfter, 10) * 1000
} else {
// Exponential backoff: 2^attempt * baseDelay
delayMs = Math.min(Math.pow(2, attempt) * baseDelayMs, maxDelayMs)
}
// Add random jitter (±25%) to avoid synchronized retry storms
if (jitter) {
delayMs = delayMs * (0.75 + Math.random() * 0.5)
}
console.warn(`Rate limited. Retrying in ${Math.round(delayMs)}ms (attempt ${attempt + 1}/${maxRetries})`)
await new Promise(resolve => setTimeout(resolve, delayMs))
}
throw new Error('Max retries exceeded')
}
- HTTP 429 vs 503: 429 is a client policy issue; 503 means service temporarily unavailable — keep semantics distinct in docs and client branches.
- Proxies and CDNs may strip
Retry-After— verify end-to-end in integration tests. - Annotate quotas and 429 response schema in OpenAPI — align with the API contract skill.
Fairness and distributed caveats
- Shared NAT / corporate egress: pure IP limits hurt—combine with cookie, JWT, API key, or session.
- Paid vs free: tiered quotas and whitelisted ops APIs need auditing so internal routes are not bypass channels.
- Clients: exponential backoff + global limiter + jitter with idempotency keys; ban unbounded retry storms.
429 Retry-After hint builder
Paste the Retry-After field value (header name excluded) to produce a local retry time and a short snippet for SKILL/client comments. Parsing runs only in the browser.
Integers parse as delay seconds per RFC 7231; otherwise parse as HTTP-date. Display uses the browser’s local timezone.
---
name: api-rate-limiting
description: Rate-limit dimensions, algorithms, and 429 semantics
---
# Rules
- Rate limit key: authenticated user ID (preferred) or IP — avoid misidentifying shared NAT
- Algorithm: token bucket (allows bursts) using Redis Lua atomic script; sliding window using ZSET
- Login/register: strict rate limiting (5 per 15min); query: relaxed (100 per min)
- Response headers: X-RateLimit-Limit / Remaining / Reset + Retry-After (seconds)
- 429 response body uses RFC 7807 problem+json format
# Steps
1. Confirm rate limit dimensions (user_id / api_key / ip) and per-endpoint quota policy
2. Implement middleware: call tokenBucketAllow() or slidingWindowAllow()
3. Mount order: before body parsing and auth (save resources)
4. Set standard response headers: X-RateLimit-* and Retry-After
5. Client documentation: exponential backoff (2^n * base) + jitter + max retries
6. Monitoring: reject rate, hot keys, per-endpoint P99 error ratio
7. Load test validation: rate limiting behavior under 10x traffic meets expectations