My Training Crashed at 3 AM and the GPU Kept Billing
The training script died at epoch 47 of 50. You woke up at 8 AM to find the GPU still running, still billing, and still producing nothing. This has happened
GPU Cost | 10 min read | 2026-04-22
The training script died at epoch 47 of 50. You woke up at 8 AM to find the GPU still running, still billing, and still producing nothing. This has happened to almost everyone who rents GPUs for training. The question is not whether it will happen to you. The question is how much it will cost when it does.
The Problem Nobody Talks About
Everyone plans their GPU rental around the happy path: script starts, runs for 6 hours, finishes cleanly, you shut down the pod. The reality is messier. OOM errors, network drops, CUDA driver crashes, dataset loading failures, and silent hangs kill training jobs all the time. And the GPU keeps billing through every single one of them.
The uncomfortable truth is that most GPU rental providers treat a running pod as a running job. As long as the container is alive, the meter runs. It does not matter if your training script crashed in the first three minutes. It does not matter if the GPU has been sitting at zero utilization for four hours. It does not matter if you are asleep and cannot respond to alerts. The pod is alive, so you pay.
This is not a provider conspiracy. It is a structural problem with how cloud GPU rental works. The provider has no way to know whether your job is making progress or sitting dead inside a live container. That gap between "container alive" and "job actually working" is where your money disappears.
How the Bill Creeps Up
The silent hang
Your training loop stops making progress but the process is still alive. No error, no crash, no exception in the logs. Just zero GPU utilization while the meter keeps running. This is the worst kind of failure because it leaves no trace until you check hours later.
Silent hangs happen for many reasons: a deadlock in the data loader, a network timeout waiting for a checkpoint to download, a CUDA context that froze but did not crash, or a distributed training worker that lost sync and is waiting for peers that will never respond.
Typical cost: 2-8 hours of billing for zero work
Why it hurts: You only discover it when you check the logs or the bill. By then the damage is done and non-refundable.
The OOM death spiral
CUDA runs out of memory at epoch 47. Your script crashes with a clear error message. But the pod is still running because nothing told it to stop. You are paying for a dead process, and you also lost all progress since your last checkpoint.
OOM errors during training are especially painful because they often happen late in the run, after hours of successful computation. The model was working fine, the loss was decreasing, everything looked good — and then a slightly larger batch, a longer sequence in the dataset, or a memory fragmentation issue pushes you over the edge.
Typical cost: Full hour billing for a job that died in minute 3, plus lost training progress
Why it hurts: Most providers bill by the hour minimum. A 3-minute crash still costs 60 minutes. And restarting from an old checkpoint burns even more GPU hours.
The checkpoint gap
You saved checkpoints every 5 epochs. The crash happened at epoch 47. You just lost 2 epochs of training and need to restart from epoch 45, burning more GPU hours to redo work you already paid for once.
This is the hidden multiplier that makes crashes worse than they appear. It is not just the wasted time between crash and discovery. It is the wasted time between your last good checkpoint and the crash, which you now have to recompute on a fresh GPU rental. You pay twice for the same epochs.
Typical cost: 10-20% extra training time from re-running epochs
Why it hurts: The wasted compute is invisible on the bill. You only see the total hours. Nobody itemizes "re-training because checkpoint was too far back."
The network disconnect
Your SSH session drops. Your Jupyter notebook disconnects. The training process might still be running in the background, or it might have died from the disconnection. You cannot tell. You spin up a new session to check, and now you have two pods billing simultaneously while you figure out what happened to the first one.
This is especially common with spot instances and preemptible GPUs, where the provider can reclaim the machine with little warning. The disconnection looks like a network issue, but it is actually the underlying hardware being pulled from under you.
Typical cost: Double billing during investigation + potential preemption loss
Why it hurts: You pay for the ghost pod while troubleshooting, and if it was preempted, you may not get any refund at all.
The Real Math
Here is what happens when you compare the planned training cost against what actually gets billed once crashes, hangs, and restarts enter the picture. These numbers use a typical A100 rental at $1.80/hour as the baseline.
| Scenario | Planned cost | Actual cost | Waste | Wasted time |
|---|---|---|---|---|
| Clean 6-hour training on A100 | $10.80 | $10.80 | $0 | 0 hours |
| Crash at hour 5, pod runs 3 more hours | $10.80 | $14.40 | $3.60 (33%) | 3 hours ghost billing |
| Silent hang for 4 hours overnight | $10.80 | $18.00 | $7.20 (67%) | 4 hours zero utilization |
| OOM + restart from old checkpoint | $10.80 | $13.20 | $2.40 (22%) | 2 epochs re-trained |
| Network disconnect + double pod | $10.80 | $16.20 | $5.40 (50%) | 1 hour of double billing |
| All of the above in one week | $10.80 | $25.20 | $14.40 (133%) | More wasted than worked |
The last row is not hypothetical. It is what happens when you run training jobs for a week without any crash protection. The waste does not just add up — it multiplies, because each failure forces you to re-run work you already paid for.
Why This Keeps Happening
The root cause is not malicious providers or broken infrastructure. It is a mismatch between how training jobs behave and how GPU rental billing works. Training is a long-running, stateful process that can fail in many different ways. GPU rental billing is a simple on-off meter that does not care about the state inside the container.
This mismatch creates a blind spot. The provider sees a running container and bills accordingly. You see a crashed job and expect billing to stop. Neither side is wrong. They are just operating on different assumptions about what "running" means.
The gap gets wider when you consider that many training failures are silent. An OOM crash produces a clear error. A silent hang produces nothing. A network disconnect produces confusion about whether the job is still alive. Each of these failure modes requires a different detection strategy, and most people do not implement any of them until after the first expensive surprise.
The failure modes that cost the most
- Silent hangs — zero utilization, no error, bills keep running for hours
- Late-stage OOM — crashes near the end, wastes the most progress, forces expensive re-runs
- Checkpoint gaps — not a billing issue directly, but doubles the GPU time you need to finish
- Ghost pods — disconnected sessions that keep running and billing while you troubleshoot
- Spot preemptions — the provider reclaims the GPU, your job dies, and you may not get a refund
How to Protect Yourself
1. Set GPU utilization alerts
If GPU utilization drops below 10% for more than 5 minutes, something is wrong. Set up a monitoring script that checks nvidia-smi and sends you a notification. This catches silent hangs before they burn hours of billing.
A simple bash loop can do this: check nvidia-smi every 60 seconds, parse the utilization percentage, and send a webhook or email if it stays below your threshold for more than 5 consecutive readings. Tools like Prometheus with a GPU exporter give you more sophisticated monitoring, but even a 20-line script catches the worst failures.
Cost to set up: 15 minutes for a basic script, 1-2 hours for proper monitoring
Savings: Catches silent hangs within 5 minutes instead of 5 hours. On an A100 at $1.80/hour, that is $9 saved per incident.
2. Checkpoint aggressively
Save checkpoints every epoch, not every 5. The storage cost is tiny compared to the GPU time you will waste re-running training. Use network volumes so checkpoints survive pod termination.
The math is simple: if a checkpoint takes 30 seconds to save and your GPU costs $1.80/hour, each checkpoint costs $0.015. If losing one epoch means re-running 20 minutes of training, that costs $0.60. One saved checkpoint pays for 40 checkpoints. Saving every epoch is almost always cheaper than losing progress.
Store checkpoints on a persistent volume, not inside the container. If the pod dies, the checkpoint survives. If you are using a provider that supports network volumes, this is a no-brainer. If not, sync checkpoints to external storage (S3, GCS, or even a simple rsync to another machine) every few epochs.
Rule of thumb: If a checkpoint takes less than 30 seconds to save, save it every epoch.
Why: Losing 1 epoch of training on an A100 can cost more than a month of checkpoint storage.
3. Auto-shutdown on crash
Wrap your training script in a supervisor that kills the pod if the process exits with an error. Do not let a dead script run on a live GPU.
The simplest pattern is a one-liner: python train.py || shutdown -h now. This ensures that if your script crashes with a non-zero exit code, the pod terminates immediately and billing stops. For more sophisticated setups, use a systemd service with Restart=no and ExecStopPost to clean up, or a Kubernetes job with restartPolicy: Never and a short activeDeadlineSeconds.
For silent hangs, you need something smarter than exit code monitoring. A watchdog process that checks GPU utilization every 60 seconds and kills the pod if it stays below a threshold for 5 minutes catches the failures that exit codes miss.
Simple pattern: python train.py || shutdown -h now
Even better: Use a cron job or systemd service that monitors GPU utilization and stops the pod automatically.
4. Use per-second billing
If your job crashes after 23 minutes, hourly billing charges you for 60 minutes. Per-second billing charges you for 23 minutes. This alone can save 30-60% on failed runs.
The difference becomes massive when you factor in multiple crashes per week. If your training job crashes twice a week on average, hourly billing wastes 74 minutes per week (37 minutes per crash × 2). Per-second billing wastes only the actual runtime before the crash. Over a month, that is 4-5 hours of billing difference on a single training setup.
Per-second billing also changes your risk calculus. When the cost of a crash is proportional to the actual time wasted, you can afford to experiment more aggressively. When hourly billing rounds every crash up to a full hour, you become conservative in ways that slow down your actual progress.
The math: 23 min crash on hourly billing = $3.60 wasted. On per-second billing = $1.38 wasted.
Over 10 failed runs: that's $22.20 saved. Over a month of regular training: $80-120 saved.
5. Set a budget cap with hard shutdown
Define a maximum spend for each training run and enforce it programmatically. If the bill exceeds your cap, the pod shuts down automatically. This prevents a single runaway job from consuming your entire GPU budget.
A simple approach: track the pod start time, calculate elapsed billing, and kill the process when it exceeds your budget. For a $15 budget on an A100 at $1.80/hour, that is 8.3 hours maximum. Set a hard limit at 8 hours with a 30-minute warning.
Implementation: A wrapper script that checks elapsed time every 5 minutes and sends a SIGTERM when the budget cap is reached.
Why this matters: Without a cap, a single silent hang over a weekend can cost more than your entire weekly training budget.
What to Do When It Happens Anyway
Even with all these protections, crashes will happen. The question is how fast you recover. Here is the incident response checklist we recommend:
Crash response checklist
- Minute 0: Check if the pod is still running. If yes, check GPU utilization with nvidia-smi.
- Minute 1: If utilization is zero, check the training logs for the last error message or output line.
- Minute 2: If the job crashed, note the epoch and step number. Compare against your last checkpoint.
- Minute 5: If the pod is ghost-billing, terminate it immediately. Do not wait to investigate.
- Minute 10: Restart from the most recent checkpoint. Verify the dataset and configuration are intact.
- Minute 15: Document what happened. Update your monitoring thresholds if the failure was not caught.
The key principle is: stop the billing first, investigate second. Every minute you spend debugging a ghost pod is a minute you pay for nothing. Kill it, then figure out what happened.
The Takeaway
Your training job will crash. It is not a matter of if, but when. The question is whether you are paying for a dead GPU while you sleep. Set up monitoring, checkpoint often, and make sure your pod shuts down when the job dies. The few minutes you spend on this will save you hours of billed GPU time.
The most expensive GPU rental is not the one with the highest hourly rate. It is the one that bills you for nothing while you think it is working. Per-second billing, aggressive checkpointing, and auto-shutdown on crash are not optional luxuries. They are the minimum protection you need before running any training job that lasts longer than an hour.
Need a GPU that bills by the second?
Compare live GPUs with transparent per-second billing and network volume support for safe checkpoints. When your training crashes at 3 AM, you should not pay for the hours you slept.
Browse GPUs