The Model Fit. vLLM Still Changed the GPU Plan.
A lot of teams test a model in a notebook, see it load, and assume the GPU decision is done. Then they move to vLLM, TGI, or another serving stack and sudden
Serving Reality | 12 min read | 2026-04-09
A lot of teams test a model in a notebook, see it load, and assume the GPU decision is done. Then they move to vLLM, TGI, or another serving stack and suddenly the same card feels a lot less comfortable. This is not a software bug. This is a sizing problem that shows up the moment your model stops being a solo experiment and starts being a service.
Why this surprises people
Because they size the GPU around the model weights only. That is not how production inference behaves. The serving runtime itself uses memory. Scheduling, prefill behavior, KV cache growth, batching, and concurrency all show up once the model stops being a private test and starts being a service.
Here is the scenario we see constantly: a data scientist loads Llama-3-70B on an A100 80GB using a simple Python script. The model fits. The first prompt returns in 2 seconds. Everyone celebrates. Then the engineering team wraps it in vLLM, connects it to the API gateway, and within an hour of real traffic the GPU hits 98% memory utilization, requests start queueing, and p99 latency jumps from 2 seconds to 14 seconds. The model did not change. The GPU did not change. The workload did.
The mistake is not technical. It is conceptual. People treat "can the model load?" as the only question that matters. In production, that is the easiest question. The hard questions are: how many concurrent requests can this GPU handle? How long are the prompts really? How fast does the KV cache grow? What happens when three users trigger long completions at the same time?
The notebook test proves less than people think
- it proves the model can load
- it proves one narrow prompt can complete
- it does not prove the serving stack has margin
- it does not prove the setup is stable under real traffic
- it does not account for KV cache growth under variable-length prompts
- it does not simulate concurrent users competing for the same GPU
- it does not reveal prefill-vs-decode bottlenecks
The notebook test is a unit test. Production inference is a load test. They measure completely different things.
What changes when vLLM enters the picture
The workload becomes more honest. You are not just holding weights anymore. You are running a scheduler, maintaining active KV cache, and trying to make throughput, latency, and concurrency coexist. That often changes the right GPU tier even when the model itself has not changed at all.
The vLLM memory overhead breakdown
When you run a model through vLLM, the GPU memory is split across several components that do not exist in a simple notebook script:
- Model weights: The static parameters. For a 70B model at FP16, this is ~140GB. At 4-bit quantized, ~35-40GB. This is the only thing most people size for.
- KV cache: The dynamic memory that grows with each active request. For a 70B model, each concurrent request can consume 2-8GB of KV cache depending on prompt length and max tokens. With 10 simultaneous users, that is 20-80GB on top of weights.
- Scheduler overhead: vLLM's internal memory management, block tables, and request queues. Typically 1-3GB depending on configuration.
- Activation memory: Temporary tensors during forward passes. Spikes during prefill, smaller during decode. Usually 2-5GB.
- Python runtime and CUDA context: The baseline overhead of the process itself. ~1-2GB.
Add these up for a 70B model at 4-bit: 40GB (weights) + 40GB (KV cache for 10 users) + 3GB (scheduler) + 4GB (activations) + 2GB (runtime) = 89GB. That exceeds an A100 80GB. The notebook test said it fits. Production says it does not.
| Stage | What feels true | What the real system says later |
|---|---|---|
| Notebook load test | The model fits | That only answers the smallest question |
| Serving stack added | The same model should still be fine | Overhead and cache start eating margin |
| Real prompts and overlapping users | Maybe the runtime is inefficient | The GPU plan was undersized for serving reality |
| Peak traffic hour | We can optimize the settings | Settings help, but they cannot create VRAM that does not exist |
The common bad reaction
People often see the first serving issue and jump straight to the biggest card on the page. That is not always the right answer. Sometimes the runtime settings are bad. Sometimes batch limits are too aggressive. But the opposite mistake is also common: pretending runtime overhead is just a software problem and refusing to admit the GPU plan has no safety margin.
Bad reaction #1: "Just upgrade to H100"
This is the most expensive wrong answer. An H100 80GB costs ₹583/hr in India — 3.4x more than an A100 80GB at ₹173/hr. If your problem is KV cache misconfiguration or an overly aggressive batch size, throwing an H100 at it wastes ₹410/hr on a problem that a settings change would fix for free. We have seen teams upgrade to H100, see marginal improvement, and still hit memory walls because they never fixed their max_num_seqs or gpu_memory_utilization settings.
Bad reaction #2: "It is just a software problem"
The opposite error. Yes, tuning vLLM settings helps. Reducing max_num_batched_tokens from 8192 to 4096 can cut prefill memory spikes. Lowering gpu_memory_utilization from 0.95 to 0.85 leaves room for KV cache growth. But if your model at 4-bit quantization needs 40GB for weights and you have 20 concurrent users each generating 4GB of KV cache, no setting change creates the missing 20GB of VRAM. At some point, the hardware is the bottleneck.
The right reaction: diagnose before you spend
Before changing anything expensive, understand where the pressure is coming from. Is it prefill memory spikes? Is it KV cache saturation? Is it scheduler queueing? Each has a different fix, and only one of them requires a bigger GPU.
What we would check before changing anything expensive
- Real prompt length distribution: Are your users sending 200-token prompts or 8,000-token documents? The difference is 40x in KV cache pressure. Check your actual p50 and p95 prompt lengths, not your test prompts.
- How many requests overlap in practice: Concurrency is the silent VRAM killer. One user at a time is trivial. Five overlapping requests with long prompts is a completely different memory profile. Check your request queue depth during peak hours.
- Whether latency pain comes from prefill, decode, or queueing: These are different problems. Prefill bottlenecks mean your initial token processing is too slow (fix: reduce max_num_batched_tokens). Decode bottlenecks mean token generation is the constraint (fix: optimize batch size or quantization). Queueing bottlenecks mean you have too many requests for your GPU count (fix: add GPUs or implement request shedding).
- How much memory headroom remains after the serving runtime starts behaving like production: Run vLLM with your actual traffic pattern for 30 minutes. Check nvidia-smi and vLLM's internal memory profiler. If you are consistently above 90% utilization, you have no margin for traffic spikes. If you are at 60-70%, your problem is likely configuration, not hardware.
A better decision rule
Do not size the GPU for "model loaded successfully." Size it for "serving stack stayed calm under realistic traffic." If your margin disappears the moment vLLM enters the picture, the plan was never as safe as it looked.
The serving capacity formula
Here is a practical way to think about it. For any given GPU and model combination:
Available VRAM for KV cache = Total VRAM - Model weights - Runtime overhead - Safety margin
For an A100 80GB running a 70B model at 4-bit:
- Total VRAM: 80GB
- Model weights (4-bit): ~40GB
- Runtime overhead (vLLM + CUDA + activations): ~6GB
- Safety margin (20% for spikes): ~7GB
- Available for KV cache: ~27GB
If each concurrent user's KV cache is ~3GB (moderate prompt lengths), you can handle ~9 simultaneous users before hitting the wall. If prompts are longer (5-8GB per user), that drops to 3-5 users. This is why the notebook test lied to you — it tested zero concurrency.
| If your situation looks like this | Likely move |
|---|---|
| One internal user, short prompts, no true serving load | Keep validating before moving up a GPU tier |
| Serving stack added, margin already thin | Tune settings and stop pretending weights are the whole problem |
| Long prompts, concurrency, queueing, unstable latency | Re-evaluate the serving plan and leave more VRAM headroom |
| KV cache at 85%+ utilization during normal traffic | Add a second GPU or reduce max concurrent requests |
| Prefill spikes causing OOM but decode is fine | Lower max_num_batched_tokens and enable chunked prefill |
| Consistent high latency even with low GPU utilization | Check CPU bottleneck, network I/O, or storage read speed |
vLLM settings that actually matter for GPU sizing
Before you change your GPU plan, check these settings. They are the difference between "barely fits" and "runs comfortably" on the same hardware:
- gpu_memory_utilization (default: 0.90): Controls how much of the GPU's memory vLLM will use. Setting this to 0.95 sounds efficient but leaves no room for KV cache spikes. We recommend 0.80-0.85 for production workloads with variable traffic.
- max_num_seqs (default: 256): Maximum number of sequences the scheduler will handle concurrently. This is often way too high for a single GPU. For a 70B model on A100 80GB, try 16-32 and monitor memory.
- max_num_batched_tokens (default: varies): Controls the maximum tokens processed in a single prefill batch. High values cause memory spikes. Start with 4096 and increase only if throughput is insufficient.
- enable_chunked_prefill: Splits long prompts into smaller chunks to reduce memory spikes during the prefill phase. Enable this if you see OOM errors with long documents.
- quantization: Running at 4-bit (AWQ/GPTQ) instead of FP16 can reduce model weight memory by 75%. This is the single most effective way to fit larger models on smaller GPUs, but it does come with a small quality tradeoff.
Real cost example: tuning vs upgrading
Let us walk through a real scenario. A team is running Llama-3-70B on an A100 80GB (₹173/hr) with default vLLM settings. They are hitting OOM errors during peak traffic and considering an upgrade to H100 (₹583/hr).
Option A: Upgrade to H100
- New hourly cost: ₹583/hr (vs ₹173/hr)
- Monthly cost (24/7): ₹4,19,760 (vs ₹1,24,560)
- Additional monthly cost: ₹2,95,200
- Result: More VRAM, but same configuration problems may still cause issues
Option B: Tune vLLM settings on existing A100
- gpu_memory_utilization: 0.95 → 0.82 (frees ~10GB for KV cache)
- max_num_seqs: 256 → 24 (reduces scheduler pressure)
- max_num_batched_tokens: 8192 → 4096 (eliminates prefill spikes)
- enable_chunked_prefill: false → true (handles long prompts gracefully)
- Additional monthly cost: ₹0
- Result: Same GPU, stable memory usage, no OOM errors
In this scenario, tuning saves ₹2,95,200/month. The H100 upgrade would have masked the configuration problem temporarily but not solved it. We have seen teams upgrade to H100 and still hit OOM because they kept max_num_seqs at 256.
When you actually do need a bigger GPU
Tuning is not always the answer. Here are the situations where upgrading is the right call:
- You need more concurrent users than one GPU can handle: If you have tuned settings and still need 50+ simultaneous requests, add a second GPU. Do not upgrade to a bigger single GPU — horizontal scaling is more cost-effective than vertical scaling for concurrency.
- Your model is too large even at 4-bit: A 405B model at 4-bit needs ~240GB. No single GPU can hold this. You need multi-GPU tensor parallelism regardless of settings.
- You need FP16 precision for quality: If your use case cannot tolerate 4-bit quantization (medical, legal, financial applications), you need the full VRAM for FP16 weights. An A100 80GB becomes an H100 80GB conversation.
- Your prompt lengths are genuinely massive: If your users regularly send 30,000+ token documents, the KV cache will be large regardless of tuning. Consider a GPU with more VRAM or implement prompt truncation strategies.
The key insight
The model fitting in a notebook is the starting line, not the finish line. Production inference is a different workload with different memory characteristics. vLLM does not change the model — it reveals whether your GPU plan was sized for reality or for a demo.
Before you upgrade, tune. Before you tune, measure. Before you measure, understand what you are actually measuring. The notebook test tells you if the model loads. vLLM under real traffic tells you if your GPU plan works. These are not the same question.
Read this next
Need to test the serving stack, not just the notebook version?
Compare live GPUs with enough room for runtime overhead, cache growth, and real traffic instead of sizing for the bare minimum load test.
Browse live GPUs