Introduction to GPU Computing Models
Before diving into specifics, let's understand the two fundamental approaches to GPU computing in the cloud.
The Core Difference
DEDICATED GPU:
"I rent a GPU by the hour. It's mine exclusively while I rent it."
→ Like renting an apartment
SERVERLESS GPU:
"I send requests. Someone's GPU handles them. I pay per request."
→ Like staying at a hotel
The Traditional Path: Dedicated Instances
This is the classic cloud computing model applied to GPUs:
┌─────────────────────────────────────────────────┐
│ Your Dedicated GPU │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ You have: │ │
│ │ • SSH access │ │
│ │ • Install any software │ │
│ │ • Run anything (training, inference) │ │
│ │ • Full GPU memory │ │
│ │ • GPU available 24/7 while renting │ │
│ └─────────────────────────────────────────┘ │
│ │
│ You pay: $X/hour regardless of usage │
└─────────────────────────────────────────────────┘
Example: Running Inference
# On dedicated instance - always running
server = InferenceServer(model="llama-70b")
server.start() # Server runs 24/7
# Every request uses the same server
result = server.inference(prompt)
The Modern Path: Serverless GPU
GPU compute is abstracted away - you just send requests:
┌─────────────────────────────────────────────────┐
│ Serverless GPU Platform │
│ │
│ ┌───────┐ ┌─────────┐ ┌───────┐ │
│ │Request├─────▶│ Magic ├─────▶│Result │ │
│ │ │ │ Happens │ │ │ │
│ └───────┘ └─────────┘ └───────┘ │
│ │
│ You DON'T know/care: │
│ • Which GPU handled it │
│ • Where the GPU is located │
│ • How many GPUs are available │
│ • Infrastructure management │
│ │
│ You pay: $Y per request or per GPU-second │
└─────────────────────────────────────────────────┘
Example: Running Inference
# Serverless - no server management
import float16
result = float16.inference(
model="llama-70b",
prompt="Hello, world!"
)
# GPU allocated → Request processed → GPU released
Quick Comparison
| Aspect | Dedicated | Serverless |
|---|---|---|
| Billing | Per hour | Per request |
| Cold Start | None (always on) | Possible delay |
| Scaling | Manual | Automatic |
| Utilization | You optimize | Provider optimizes |
| Control | Full | Limited |
| Setup | Complex | Simple |
Why This Matters
Utilization Problem
Most GPU workloads don't use GPU 100% of the time:
Dedicated Instance (24 hours):
┌────────────────────────────────────────────────┐
│████░░░░████░░░░░░░░████░░░░░░░░████░░░░░░░░░░░│
└────────────────────────────────────────────────┘
↑ ↑ ↑ ↑
Burst Burst Burst Burst
Actual GPU utilization: 20%
You paid for: 100% of the time
Wasted: 80% of your money
Serverless Solution
Serverless (Same workload):
┌────────────────────────────────────────────────┐
│Request → Process → Done (Pay for 50ms) │
│ ... (no cost while idle) ... │
│Request → Process → Done (Pay for 50ms) │
└────────────────────────────────────────────────┘
You pay only for actual compute time.
No idle costs.
The Spectrum of Options
GPU access exists on a spectrum:
Most Control Least Control
Most Management Least Management
│ │
▼ ▼
┌─────────┬───────────┬────────────┬───────────┐
│ Bare │ Dedicated │ Managed │ Serverless│
│ Metal │ VM │ Container │ Function │
│ │ │ │ │
│ Colo │ IaaS │ PaaS │ FaaS/AaaS │
└─────────┴───────────┴────────────┴───────────┘
Real-World Analogy: Transportation
DEDICATED GPU = Owning a Car
• Always available
• Pay insurance, maintenance, parking even when parked
• Full control over routes, timing
• Best for daily commuting
SERVERLESS GPU = Uber/Lyft
• Available on demand
• Pay only per trip
• No maintenance worries
• Best for occasional trips
When Did Serverless GPU Emerge?
Timeline of GPU access evolution:
2012: GPU cloud (AWS GPU instances)
→ Rent dedicated GPUs by the hour
2017: Container-based GPU (NVIDIA Docker)
→ Deploy GPU containers easily
2020: Managed GPU services (SageMaker, etc.)
→ Simplified GPU deployment
2022+: Serverless GPU (Modal, RunPod, Float16)
→ Pay-per-request GPU inference
The Business Case
Startup Scenario
Traditional Approach:
- Rent 4x A100 GPUs: $12,000/month
- Average utilization: 15%
- Effective cost per used GPU-hour: $27.78
Serverless Approach:
- Pay per inference: $0.001/request
- 1M requests/month: $1,000/month
- Scale automatically with demand
Enterprise Scenario
Traditional Approach:
- Predictable workload
- 24/7 training jobs
- High utilization (>70%)
- Dedicated: Better economics
Serverless Approach:
- Good for inference APIs
- Variable traffic patterns
- Dev/test environments
- Burst capacity needs
What's Next?
In the next chapter, we'll dive deep into serverless GPU - how it works, cold starts, pricing models, and when it excels.