API rate limiting

Limit at the gateway or app layer by user, tenant, IP, or API key; pick algorithms and storage (in-process, Redis); standardize 429 responses and Retry-After.

This page gives Agents a complete API rate limiting reference: Redis-based token bucket algorithm, sliding window algorithm, differentiated configuration for different endpoints (login vs query), standard response headers (X-RateLimit-*), and client-side exponential backoff code.

The SKILL should explain the trade-offs of token bucket (allows bursts), leaky bucket (smooth output), and sliding window; in distributed scenarios use Redis Lua scripts for atomicity, avoiding the race condition in INCR + EXPIRE.

Document exponential backoff, jitter, and idempotency keys in client documentation; layer with circuit breaking — rate limits protect resources, breakers isolate failures.

Rate-limit decision flow

  [ Ingress: derive key user / tenant / ip / api_key ]
                    │
                    ▼
         [ Storage: in-process counter / Redis / gateway plugin ]
                    │
                    ▼
    [ Algorithm: token bucket (bursty) / leaky bucket / sliding or fixed window ]
                    │
           ┌────────┴────────┐
           ▼                 ▼
      [ Allow: deduct quota ]   [ Deny: 429 Too Many Requests ]
           │                 │
           │                 ├── Retry-After: seconds or HTTP-date
           │                 ├── RateLimit-* / X-RateLimit-* (per team standard)
           │                 └── body: machine-readable code + human text
           ▼
    [ Metrics: reject rate, hot keys, quota remainder sampling ]

When implementing or reviewing, document “how keys are computed” and “what we return on deny” together so gateway and app layers do not disagree silently.

Token bucket and sliding window Redis implementation

Token bucket algorithm (Redis Lua script, atomic operations):

// ratelimit/tokenBucket.ts
import { Redis } from 'ioredis'

const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])    -- bucket capacity (max burst)
local rate = tonumber(ARGV[2])        -- refill rate per second
local now = tonumber(ARGV[3])         -- current timestamp (ms)
local cost = tonumber(ARGV[4])        -- tokens consumed by this request

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Calculate tokens to refill
local elapsed = (now - last_refill) / 1000
local refill = elapsed * rate
tokens = math.min(capacity, tokens + refill)

if tokens < cost then
  -- Not enough tokens, deny
  redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
  redis.call('EXPIRE', key, math.ceil(capacity / rate) + 1)
  return {0, math.ceil((cost - tokens) / rate * 1000)}
end

tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) + 1)
return {1, tokens}
`

export async function tokenBucketAllow(
  redis: Redis,
  key: string,
  options: { capacity: number; rate: number; cost?: number }
): Promise<{ allowed: boolean; remaining: number; retryAfterMs: number }> {
  const [allowed, value] = await redis.eval(
    TOKEN_BUCKET_LUA, 1, key,
    options.capacity, options.rate, Date.now(), options.cost ?? 1
  ) as [number, number]
  return {
    allowed: allowed === 1,
    remaining: allowed === 1 ? value : 0,
    retryAfterMs: allowed === 0 ? value : 0,
  }
}

Sliding window algorithm (Redis Sorted Set implementation):

// ratelimit/slidingWindow.ts
export async function slidingWindowAllow(
  redis: Redis,
  key: string,
  limit: number,       // max requests in window
  windowMs: number     // window size (milliseconds)
): Promise<{ allowed: boolean; remaining: number }> {
  const now = Date.now()
  const windowStart = now - windowMs

  const pipeline = redis.pipeline()
  // Remove old records outside the window
  pipeline.zremrangebyscore(key, '-inf', windowStart)
  // Count requests in current window
  pipeline.zcard(key)
  // Add current request (score = timestamp, member = unique ID)
  pipeline.zadd(key, now, `${now}-${Math.random()}`)
  // Set key expiry (auto-cleanup)
  pipeline.pexpire(key, windowMs)

  const results = await pipeline.exec()
  const count = (results![1][1] as number)

  if (count >= limit) {
    // Over limit, remove the just-added record
    await redis.zpopmax(key)
    return { allowed: false, remaining: 0 }
  }
  return { allowed: true, remaining: limit - count - 1 }
}

Differentiated rate limit configuration for different endpoints:

// middleware/rateLimit.ts
import { Request, Response, NextFunction } from 'express'

const RATE_LIMIT_CONFIGS = {
  // Login: strict rate limiting to prevent brute force
  login: { capacity: 5, rate: 1/60, windowMs: 15 * 60_000 },
  // Query: relaxed, allows bursts
  query: { capacity: 100, rate: 10, windowMs: 60_000 },
  // Write operations: moderate
  write: { capacity: 30, rate: 5, windowMs: 60_000 },
  // File upload: strict
  upload: { capacity: 10, rate: 1/60, windowMs: 60_000 },
}

export function rateLimit(type: keyof typeof RATE_LIMIT_CONFIGS) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const config = RATE_LIMIT_CONFIGS[type]
    const key = `rl:${type}:${req.user?.id ?? req.ip}`
    const result = await tokenBucketAllow(redis, key, config)

    // Standard response headers
    res.set({
      'X-RateLimit-Limit': String(config.capacity),
      'X-RateLimit-Remaining': String(result.remaining),
      'X-RateLimit-Reset': String(Math.ceil((Date.now() + result.retryAfterMs) / 1000)),
    })

    if (!result.allowed) {
      res.set('Retry-After', String(Math.ceil(result.retryAfterMs / 1000)))
      return res.status(429).json({
        type: 'https://api.example.com/problems/rate-limited',
        title: 'Too Many Requests',
        status: 429,
        retryAfter: Math.ceil(result.retryAfterMs / 1000),
      })
    }
    next()
  }
}

Rate limit response headers and client backoff

Standard response headers (X-RateLimit-* series):

# 429 response example (with complete rate limit headers)
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
X-RateLimit-Limit: 100        # quota ceiling within window
X-RateLimit-Remaining: 0      # remaining count
X-RateLimit-Reset: 1744380060 # Unix timestamp (window reset time)
Retry-After: 47               # suggested wait in seconds (RFC 7231)

{
  "type": "https://api.example.com/problems/rate-limited",
  "title": "Too Many Requests",
  "status": 429,
  "detail": "API rate limit exceeded. Limit: 100 requests per minute.",
  "retryAfter": 47
}

Client-side exponential backoff (with jitter):

// utils/fetchWithRetry.ts
interface RetryOptions {
  maxRetries?: number
  baseDelayMs?: number
  maxDelayMs?: number
  jitter?: boolean
}

async function fetchWithRetry(
  url: string,
  init?: RequestInit,
  options: RetryOptions = {}
): Promise<Response> {
  const { maxRetries = 3, baseDelayMs = 1000, maxDelayMs = 30000, jitter = true } = options

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await fetch(url, init)

    if (response.status !== 429) return response
    if (attempt === maxRetries) return response

    // Prefer server Retry-After header
    const retryAfter = response.headers.get('Retry-After')
    let delayMs: number

    if (retryAfter && /^\d+$/.test(retryAfter)) {
      delayMs = parseInt(retryAfter, 10) * 1000
    } else {
      // Exponential backoff: 2^attempt * baseDelay
      delayMs = Math.min(Math.pow(2, attempt) * baseDelayMs, maxDelayMs)
    }

    // Add random jitter (±25%) to avoid synchronized retry storms
    if (jitter) {
      delayMs = delayMs * (0.75 + Math.random() * 0.5)
    }

    console.warn(`Rate limited. Retrying in ${Math.round(delayMs)}ms (attempt ${attempt + 1}/${maxRetries})`)
    await new Promise(resolve => setTimeout(resolve, delayMs))
  }
  throw new Error('Max retries exceeded')
}

HTTP 429 vs 503: 429 is a client policy issue; 503 means service temporarily unavailable — keep semantics distinct in docs and client branches.
Proxies and CDNs may strip Retry-After — verify end-to-end in integration tests.
Annotate quotas and 429 response schema in OpenAPI — align with the API contract skill.

Fairness and distributed caveats

Shared NAT / corporate egress: pure IP limits hurt—combine with cookie, JWT, API key, or session.
Paid vs free: tiered quotas and whitelisted ops APIs need auditing so internal routes are not bypass channels.
Clients: exponential backoff + global limiter + jitter with idempotency keys; ban unbounded retry storms.

429 Retry-After hint builder

Paste the Retry-After field value (header name excluded) to produce a local retry time and a short snippet for SKILL/client comments. Parsing runs only in the browser.

Retry-After value

Integers parse as delay seconds per RFC 7231; otherwise parse as HTTP-date. Display uses the browser’s local timezone.

---
name: api-rate-limiting
description: Rate-limit dimensions, algorithms, and 429 semantics
---
# Rules
- Rate limit key: authenticated user ID (preferred) or IP — avoid misidentifying shared NAT
- Algorithm: token bucket (allows bursts) using Redis Lua atomic script; sliding window using ZSET
- Login/register: strict rate limiting (5 per 15min); query: relaxed (100 per min)
- Response headers: X-RateLimit-Limit / Remaining / Reset + Retry-After (seconds)
- 429 response body uses RFC 7807 problem+json format

# Steps
1. Confirm rate limit dimensions (user_id / api_key / ip) and per-endpoint quota policy
2. Implement middleware: call tokenBucketAllow() or slidingWindowAllow()
3. Mount order: before body parsing and auth (save resources)
4. Set standard response headers: X-RateLimit-* and Retry-After
5. Client documentation: exponential backoff (2^n * base) + jitter + max retries
6. Monitoring: reject rate, hot keys, per-endpoint P99 error ratio
7. Load test validation: rate limiting behavior under 10x traffic meets expectations

Back to skills More skills