LLM Rate Limits Explained: RPM vs TPM and Why You Get 429 Errors

Most teams misunderstand LLM rate limits until traffic hits production. They check requests per minute, see enough headroom, launch the feature, and still ge

AI Infrastructure | 8 min read | 2026-05-13

Most teams misunderstand LLM rate limits until traffic hits production. They check requests per minute, see enough headroom, launch the feature, and still get 429 errors. The missing piece is usually tokens per minute.

For AI APIs, rate limiting is not just about how many HTTP requests you send. It is also about how many tokens those requests consume, how long the model streams, how often users retry, and whether your app sends full chat history every turn.

RPM: requests per minute

RPM limits the number of requests your key, user, model, or account can start in a minute. If the limit is 200 RPM, the system may reject request 201 even if every request is tiny.

RPM is easy to understand, but it is not enough for capacity planning because two requests can have completely different sizes. One request might ask for a one-line answer. Another might send a long conversation and generate 4,000 output tokens.

TPM: tokens per minute

TPM limits total token throughput. Depending on the provider, this can include input tokens, output tokens, or a combination of both. For practical planning, assume both sides matter.

If your model limit is 200,000 TPM and each request uses 2,000 total tokens, the theoretical maximum is around 100 requests per minute before retries. If each request uses 10,000 tokens, the same TPM limit only supports around 20 requests per minute.

Why 429 happens even when RPM looks safe

Imagine a chat endpoint with a 200 RPM limit and a 200,000 TPM limit. Your dashboard shows only 80 requests per minute, so it looks safe. But each request sends 1,500 input tokens and gets 2,000 output tokens. That is 3,500 tokens per request.

80 requests/min x 3,500 tokens/request = 280,000 tokens/min

RPM is fine. TPM is not. The result is still 429.

Streaming does not remove rate limits

Streaming improves user experience because the user sees tokens as they are generated. It does not make the request free, and it does not remove the output from your token budget.

A streaming response that generates 4,000 tokens still consumes 4,000 output tokens. It may also hold a connection open for 20 or 30 seconds, which matters for your own backend and queueing model.

Retries multiply load

Retries are useful when a network call fails before the model starts generating. But blind retries can double traffic. If your frontend retries, your backend retries, and the user clicks regenerate, one user action can become several model requests.

This is why retry factor belongs in capacity planning. A system that looks safe at 1.0x load may fail at 1.3x once retries and regenerate behavior are included.

Long chat history changes every turn

Most chat apps resend conversation history with every request. That is normal, because model APIs are stateless. But it means turn 20 can be much more expensive than turn 2.

To control TPM, trim irrelevant messages, summarize older context, and avoid sending logs, tool output, or stale assistant text unless it is needed for the next answer.

Concurrency is not capacity

Concurrent users are not the same thing as requests per minute. If 100 users each send one request every 30 seconds, the expected load is roughly 200 requests per minute. If each user sends one request every 10 seconds, the same 100 users become 600 requests per minute.

expected RPM = concurrent users x (60 / average seconds per request cycle)

That is before retries. Add retry factor and long outputs, and the real bottleneck can move quickly.

How to reduce 429 errors

  • Set max output tokens: Keep answers bounded for user-facing flows.
  • Use queues for bursts: Smooth sudden spikes instead of sending every request at once.
  • Trim context: Send the messages that matter, not the entire session forever.
  • Limit retries: Retry only safe failures and track retry count.
  • Use model-specific limits: Different models may have different practical throughput.
  • Show good errors: Tell users to wait or retry later instead of failing silently.

Use the calculator

Before launch, estimate your traffic shape with RPM, TPM, average input tokens, average output tokens, retry factor, and concurrent users.

Use the LLM Rate Limit Calculator to see whether RPM or TPM is your bottleneck. If cost is the bigger question, use the AI API Cost Calculator.

The practical takeaway

If you are getting 429s, do not only ask for a higher RPM. Check output length, retry behavior, full chat history, and TPM. In production LLM apps, tokens per minute often becomes the real limit first.