The Model Fit. vLLM Still Changed the GPU Plan.

A lot of teams test a model in a notebook, see it load, and assume the GPU decision is done. Then they move to vLLM, TGI, or another serving stack and sudden

Serving Reality | 12 min read | 2026-04-09

A lot of teams test a model in a notebook, see it load, and assume the GPU decision is done. Then they move to vLLM, TGI, or another serving stack and suddenly the same card feels a lot less comfortable. This is not a software bug. This is a sizing problem that shows up the moment your model stops being a solo experiment and starts being a service.

Why this surprises people

Because they size the GPU around the model weights only. That is not how production inference behaves. The serving runtime itself uses memory. Scheduling, prefill behavior, KV cache growth, batching, and concurrency all show up once the model stops being a private test and starts being a service.

Here is the scenario we see constantly: a data scientist loads Llama-3-70B on an A100 80GB using a simple Python script. The model fits. The first prompt returns in 2 seconds. Everyone celebrates. Then the engineering team wraps it in vLLM, connects it to the API gateway, and within an hour of real traffic the GPU hits 98% memory utilization, requests start queueing, and p99 latency jumps from 2 seconds to 14 seconds. The model did not change. The GPU did not change. The workload did.

The mistake is not technical. It is conceptual. People treat "can the model load?" as the only question that matters. In production, that is the easiest question. The hard questions are: how many concurrent requests can this GPU handle? How long are the prompts really? How fast does the KV cache grow? What happens when three users trigger long completions at the same time?

The notebook test proves less than people think

it proves the model can load
it proves one narrow prompt can complete
it does not prove the serving stack has margin
it does not prove the setup is stable under real traffic
it does not account for KV cache growth under variable-length prompts
it does not simulate concurrent users competing for the same GPU
it does not reveal prefill-vs-decode bottlenecks

The notebook test is a unit test. Production inference is a load test. They measure completely different things.

What changes when vLLM enters the picture

The workload becomes more honest. You are not just holding weights anymore. You are running a scheduler, maintaining active KV cache, and trying to make throughput, latency, and concurrency coexist. That often changes the right GPU tier even when the model itself has not changed at all.

The vLLM memory overhead breakdown

When you run a model through vLLM, the GPU memory is split across several components that do not exist in a simple notebook script:

Model weights: The static parameters. For a 70B model at FP16, this is ~140GB. At 4-bit quantized, ~35-40GB. This is the only thing most people size for.
KV cache: The dynamic memory that grows with each active request. For a 70B model, each concurrent request can consume 2-8GB of KV cache depending on prompt length and max tokens. With 10 simultaneous users, that is 20-80GB on top of weights.
Scheduler overhead: vLLM's internal memory management, block tables, and request queues. Typically 1-3GB depending on configuration.
Activation memory: Temporary tensors during forward passes. Spikes during prefill, smaller during decode. Usually 2-5GB.
Python runtime and CUDA context: The baseline overhead of the process itself. ~1-2GB.

Add these up for a 70B model at 4-bit: 40GB (weights) + 40GB (KV cache for 10 users) + 3GB (scheduler) + 4GB (activations) + 2GB (runtime) = 89GB. That exceeds an A100 80GB. The notebook test said it fits. Production says it does not.

Stage	What feels true	What the real system says later
Notebook load test	The model fits	That only answers the smallest question
Serving stack added	The same model should still be fine	Overhead and cache start eating margin
Real prompts and overlapping users	Maybe the runtime is inefficient	The GPU plan was undersized for serving reality
Peak traffic hour	We can optimize the settings	Settings help, but they cannot create VRAM that does not exist

The common bad reaction

People often see the first serving issue and jump straight to the biggest card on the page. That is not always the right answer. Sometimes the runtime settings are bad. Sometimes batch limits are too aggressive. But the opposite mistake is also common: pretending runtime overhead is just a software problem and refusing to admit the GPU plan has no safety margin.

Bad reaction #1: "Just upgrade to H100"

This is the most expensive wrong answer. An H100 80GB costs ₹583/hr in India — 3.4x more than an A100 80GB at ₹173/hr. If your problem is KV cache misconfiguration or an overly aggressive batch size, throwing an H100 at it wastes ₹410/hr on a problem that a settings change would fix for free. We have seen teams upgrade to H100, see marginal improvement, and still hit memory walls because they never fixed their max_num_seqs or gpu_memory_utilization settings.

Bad reaction #2: "It is just a software problem"

The opposite error. Yes, tuning vLLM settings helps. Reducing max_num_batched_tokens from 8192 to 4096 can cut prefill memory spikes. Lowering gpu_memory_utilization from 0.95 to 0.85 leaves room for KV cache growth. But if your model at 4-bit quantization needs 40GB for weights and you have 20 concurrent users each generating 4GB of KV cache, no setting change creates the missing 20GB of VRAM. At some point, the hardware is the bottleneck.

The right reaction: diagnose before you spend

Before changing anything expensive, understand where the pressure is coming from. Is it prefill memory spikes? Is it KV cache saturation? Is it scheduler queueing? Each has a different fix, and only one of them requires a bigger GPU.

What we would check before changing anything expensive

Real prompt length distribution: Are your users sending 200-token prompts or 8,000-token documents? The difference is 40x in KV cache pressure. Check your actual p50 and p95 prompt lengths, not your test prompts.
How many requests overlap in practice: Concurrency is the silent VRAM killer. One user at a time is trivial. Five overlapping requests with long prompts is a completely different memory profile. Check your request queue depth during peak hours.
Whether latency pain comes from prefill, decode, or queueing: These are different problems. Prefill bottlenecks mean your initial token processing is too slow (fix: reduce max_num_batched_tokens). Decode bottlenecks mean token generation is the constraint (fix: optimize batch size or quantization). Queueing bottlenecks mean you have too many requests for your GPU count (fix: add GPUs or implement request shedding).
How much memory headroom remains after the serving runtime starts behaving like production: Run vLLM with your actual traffic pattern for 30 minutes. Check nvidia-smi and vLLM's internal memory profiler. If you are consistently above 90% utilization, you have no margin for traffic spikes. If you are at 60-70%, your problem is likely configuration, not hardware.

A better decision rule

Do not size the GPU for "model loaded successfully." Size it for "serving stack stayed calm under realistic traffic." If your margin disappears the moment vLLM enters the picture, the plan was never as safe as it looked.

The serving capacity formula

Here is a practical way to think about it. For any given GPU and model combination:

Available VRAM for KV cache = Total VRAM - Model weights - Runtime overhead - Safety margin

For an A100 80GB running a 70B model at 4-bit:

Total VRAM: 80GB
Model weights (4-bit): ~40GB
Runtime overhead (vLLM + CUDA + activations): ~6GB
Safety margin (20% for spikes): ~7GB
Available for KV cache: ~27GB

If each concurrent user's KV cache is ~3GB (moderate prompt lengths), you can handle ~9 simultaneous users before hitting the wall. If prompts are longer (5-8GB per user), that drops to 3-5 users. This is why the notebook test lied to you — it tested zero concurrency.

If your situation looks like this	Likely move
One internal user, short prompts, no true serving load	Keep validating before moving up a GPU tier
Serving stack added, margin already thin	Tune settings and stop pretending weights are the whole problem
Long prompts, concurrency, queueing, unstable latency	Re-evaluate the serving plan and leave more VRAM headroom
KV cache at 85%+ utilization during normal traffic	Add a second GPU or reduce max concurrent requests
Prefill spikes causing OOM but decode is fine	Lower max_num_batched_tokens and enable chunked prefill
Consistent high latency even with low GPU utilization	Check CPU bottleneck, network I/O, or storage read speed

vLLM settings that actually matter for GPU sizing

Before you change your GPU plan, check these settings. They are the difference between "barely fits" and "runs comfortably" on the same hardware:

gpu_memory_utilization (default: 0.90): Controls how much of the GPU's memory vLLM will use. Setting this to 0.95 sounds efficient but leaves no room for KV cache spikes. We recommend 0.80-0.85 for production workloads with variable traffic.
max_num_seqs (default: 256): Maximum number of sequences the scheduler will handle concurrently. This is often way too high for a single GPU. For a 70B model on A100 80GB, try 16-32 and monitor memory.
max_num_batched_tokens (default: varies): Controls the maximum tokens processed in a single prefill batch. High values cause memory spikes. Start with 4096 and increase only if throughput is insufficient.
enable_chunked_prefill: Splits long prompts into smaller chunks to reduce memory spikes during the prefill phase. Enable this if you see OOM errors with long documents.
quantization: Running at 4-bit (AWQ/GPTQ) instead of FP16 can reduce model weight memory by 75%. This is the single most effective way to fit larger models on smaller GPUs, but it does come with a small quality tradeoff.

Real cost example: tuning vs upgrading

Let us walk through a real scenario. A team is running Llama-3-70B on an A100 80GB (₹173/hr) with default vLLM settings. They are hitting OOM errors during peak traffic and considering an upgrade to H100 (₹583/hr).

Option A: Upgrade to H100

New hourly cost: ₹583/hr (vs ₹173/hr)
Monthly cost (24/7): ₹4,19,760 (vs ₹1,24,560)
Additional monthly cost: ₹2,95,200
Result: More VRAM, but same configuration problems may still cause issues

Option B: Tune vLLM settings on existing A100

gpu_memory_utilization: 0.95 → 0.82 (frees ~10GB for KV cache)
max_num_seqs: 256 → 24 (reduces scheduler pressure)
max_num_batched_tokens: 8192 → 4096 (eliminates prefill spikes)
enable_chunked_prefill: false → true (handles long prompts gracefully)
Additional monthly cost: ₹0
Result: Same GPU, stable memory usage, no OOM errors

In this scenario, tuning saves ₹2,95,200/month. The H100 upgrade would have masked the configuration problem temporarily but not solved it. We have seen teams upgrade to H100 and still hit OOM because they kept max_num_seqs at 256.

When you actually do need a bigger GPU

Tuning is not always the answer. Here are the situations where upgrading is the right call:

You need more concurrent users than one GPU can handle: If you have tuned settings and still need 50+ simultaneous requests, add a second GPU. Do not upgrade to a bigger single GPU — horizontal scaling is more cost-effective than vertical scaling for concurrency.
Your model is too large even at 4-bit: A 405B model at 4-bit needs ~240GB. No single GPU can hold this. You need multi-GPU tensor parallelism regardless of settings.
You need FP16 precision for quality: If your use case cannot tolerate 4-bit quantization (medical, legal, financial applications), you need the full VRAM for FP16 weights. An A100 80GB becomes an H100 80GB conversation.
Your prompt lengths are genuinely massive: If your users regularly send 30,000+ token documents, the KV cache will be large regardless of tuning. Consider a GPU with more VRAM or implement prompt truncation strategies.

The key insight

The model fitting in a notebook is the starting line, not the finish line. Production inference is a different workload with different memory characteristics. vLLM does not change the model — it reveals whether your GPU plan was sized for reality or for a demo.

Before you upgrade, tune. Before you tune, measure. Before you measure, understand what you are actually measuring. The notebook test tells you if the model loads. vLLM under real traffic tells you if your GPU plan works. These are not the same question.

Read this next

Batch size became real

KV cache changed the math

4-bit did not save the workload

Need to test the serving stack, not just the notebook version?

Compare live GPUs with enough room for runtime overhead, cache growth, and real traffic instead of sizing for the bare minimum load test.

Browse live GPUs