- Home
- /
- Learn
- /
- Serverless GPU vs Dedicated Instances
- /
- Introduction to GPU Computing Models
Chapter 1 of 5•4 min read
Introduction to GPU Computing Models
Understanding the fundamental differences between serverless and dedicated GPU computing - the two main paradigms for accessing GPU resources.
Introduction to GPU Computing Models
Before diving into specifics, let's understand the two fundamental approaches to GPU computing in the cloud.
The Core Difference
DEDICATED GPU:
"I rent a GPU by the hour. It's mine exclusively while I rent it."
→ Like renting an apartment
SERVERLESS GPU:
"I send requests. Someone's GPU handles them. I pay per request."
→ Like staying at a hotel
The Traditional Path: Dedicated Instances
This is the classic cloud computing model applied to GPUs:
┌─────────────────────────────────────────────────┐
│ Your Dedicated GPU │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ You have: │ │
│ │ • SSH access │ │
│ │ • Install any software │ │
│ │ • Run anything (training, inference) │ │
│ │ • Full GPU memory │ │
│ │ • GPU available 24/7 while renting │ │
│ └─────────────────────────────────────────┘ │
│ │
│ You pay: $X/hour regardless of usage │
└─────────────────────────────────────────────────┘
Example: Running Inference
# On dedicated instance - always running
server = InferenceServer(model="llama-70b")
server.start() # Server runs 24/7
# Every request uses the same server
result = server.inference(prompt)
The Modern Path: Serverless GPU
GPU compute is abstracted away - you just send requests:
┌─────────────────────────────────────────────────┐
│ Serverless GPU Platform │
│ │
│ ┌───────┐ ┌─────────┐ ┌───────┐ │
│ │Request├─────▶│ Magic ├─────▶│Result │ │
│ │ │ │ Happens │ │ │ │
│ └───────┘ └─────────┘ └───────┘ │
│ │
│ You DON'T know/care: │
│ • Which GPU handled it │
│ • Where the GPU is located │
│ • How many GPUs are available │
│ • Infrastructure management │
│ │
│ You pay: $Y per request or per GPU-second │
└─────────────────────────────────────────────────┘
Example: Running Inference
# Serverless - no server management
import float16
result = float16.inference(
model="llama-70b",
prompt="Hello, world!"
)
# GPU allocated → Request processed → GPU released
Quick Comparison
| Aspect | Dedicated | Serverless |
|---|---|---|
| Billing | Per hour | Per request |
| Cold Start | None (always on) | Possible delay |
| Scaling | Manual | Automatic |
| Utilization | You optimize | Provider optimizes |
| Control | Full | Limited |
| Setup | Complex | Simple |
Why This Matters
Utilization Problem
Most GPU workloads don't use GPU 100% of the time:
Dedicated Instance (24 hours):
┌────────────────────────────────────────────────┐
│████░░░░████░░░░░░░░████░░░░░░░░████░░░░░░░░░░░│
└────────────────────────────────────────────────┘
↑ ↑ ↑ ↑
Burst Burst Burst Burst
Actual GPU utilization: 20%
You paid for: 100% of the time
Wasted: 80% of your money
Serverless Solution
Serverless (Same workload):
┌────────────────────────────────────────────────┐
│Request → Process → Done (Pay for 50ms) │
│ ... (no cost while idle) ... │
│Request → Process → Done (Pay for 50ms) │
└────────────────────────────────────────────────┘
You pay only for actual compute time.
No idle costs.
The Spectrum of Options
GPU access exists on a spectrum:
Most Control Least Control
Most Management Least Management
│ │
▼ ▼
┌─────────┬───────────┬────────────┬───────────┐
│ Bare │ Dedicated │ Managed │ Serverless│
│ Metal │ VM │ Container │ Function │
│ │ │ │ │
│ Colo │ IaaS │ PaaS │ FaaS/AaaS │
└─────────┴───────────┴────────────┴───────────┘
Real-World Analogy: Transportation
DEDICATED GPU = Owning a Car
• Always available
• Pay insurance, maintenance, parking even when parked
• Full control over routes, timing
• Best for daily commuting
SERVERLESS GPU = Uber/Lyft
• Available on demand
• Pay only per trip
• No maintenance worries
• Best for occasional trips
When Did Serverless GPU Emerge?
Timeline of GPU access evolution:
2012: GPU cloud (AWS GPU instances)
→ Rent dedicated GPUs by the hour
2017: Container-based GPU (NVIDIA Docker)
→ Deploy GPU containers easily
2020: Managed GPU services (SageMaker, etc.)
→ Simplified GPU deployment
2022+: Serverless GPU (Modal, RunPod, Float16)
→ Pay-per-request GPU inference
The Business Case
Startup Scenario
Traditional Approach:
- Rent 4x A100 GPUs: $12,000/month
- Average utilization: 15%
- Effective cost per used GPU-hour: $27.78
Serverless Approach:
- Pay per inference: $0.001/request
- 1M requests/month: $1,000/month
- Scale automatically with demand
Enterprise Scenario
Traditional Approach:
- Predictable workload
- 24/7 training jobs
- High utilization (>70%)
- Dedicated: Better economics
Serverless Approach:
- Good for inference APIs
- Variable traffic patterns
- Dev/test environments
- Burst capacity needs
What's Next?
In the next chapter, we'll dive deep into serverless GPU - how it works, cold starts, pricing models, and when it excels.