- Home
- /
- Learn
- /
- Serverless GPU vs Dedicated Instances
- /
- Serverless GPU Deep Dive
Chapter 2 of 5•5 min read
Serverless GPU Deep Dive
Understanding how serverless GPU works - the technology, cold starts, pricing models, auto-scaling, and when it's the right choice.
Serverless GPU Deep Dive
Serverless GPU computing abstracts away infrastructure management. Let's understand how it works and when to use it.
How Serverless GPU Works
┌─────────────────────────────────────────────────────────┐
│ Request Flow │
│ │
│ 1. Request → API Gateway │
│ ↓ │
│ 2. Find/Spin up GPU instance (if needed) │
│ ↓ │
│ 3. Load model (if not cached) │
│ ↓ │
│ 4. Run inference │
│ ↓ │
│ 5. Return result │
│ ↓ │
│ 6. Keep warm or shut down │
└─────────────────────────────────────────────────────────┘
The Cold Start Problem
The biggest challenge with serverless GPU is cold starts:
What is a Cold Start?
WARM REQUEST (Container already running):
Request → Inference → Response
└── 50ms ──┘
COLD REQUEST (Need to start container):
Request → Spin up → Load Model → Inference → Response
└─ 5s ──┘ └── 10s ──┘ └── 50ms ──┘
Total: 15+ seconds
Why Cold Starts Happen
┌─────────────────────────────────────────────────┐
│ GPU Resource Pool │
│ │
│ [Running] [Running] [Empty] [Empty] [Empty] │
│ ↑ ↑ │
│ Handling Handling │
│ requests requests │
│ │
│ New request arrives, no running instances: │
│ → Must start new instance (COLD START) │
└─────────────────────────────────────────────────┘
Cold Start Components
| Component | Time | Can Optimize? |
|---|---|---|
| Instance allocation | 2-10s | Platform |
| Container start | 1-5s | Image size |
| Model loading | 5-60s | Model size |
| First inference | 1-5s | Model optimization |
Mitigating Cold Starts
1. Keep Warm (Min Instances)
# Configure minimum running instances
deployment:
min_instances: 2 # Always keep 2 warm
max_instances: 10 # Scale up to 10
Trade-off: Pay for idle capacity.
2. Model Caching
First request: Load model from storage → GPU memory
(Slow: 10-30 seconds)
Subsequent requests: Model already in GPU memory
(Fast: milliseconds)
3. Smaller Models
Model Size → Load Time
70B parameters → 30-60 seconds
7B parameters → 5-10 seconds
1B parameters → 1-2 seconds
4. Optimized Model Formats
Original Model: PyTorch (.pt)
→ Convert to: TensorRT, ONNX, or Triton
→ 2-5x faster loading
Serverless GPU Pricing Models
Pay-Per-Request
Pricing: $0.001 per request
Simple to understand
Best for: Consistent request sizes
Example:
1M requests × $0.001 = $1,000/month
Pay-Per-Second
Pricing: $0.0001 per GPU-second
Fairer for variable workloads
Best for: Mixed short/long requests
Example:
Request A: 0.5s × $0.0001 = $0.00005
Request B: 10s × $0.0001 = $0.001
Pay-Per-Token (LLM Specific)
Pricing:
- Input: $0.001 per 1K tokens
- Output: $0.002 per 1K tokens
Best for: LLM inference
Example:
1K input tokens + 500 output tokens
= $0.001 + $0.001 = $0.002
Auto-Scaling
Serverless platforms automatically scale based on demand:
Traffic Pattern:
┌─────┐
│ │
┌──────┘ └──────┐
┌──────┘ └──────┐
──────┘ └──────
8AM 12PM 8PM
GPU Instances:
[10 instances]
[5 instances] [5 instances]
[2 instances] [2 instances]
[1 instance] [1 instance]
Scaling Metrics
- Queue depth: Requests waiting
- Latency: Response time increasing
- Utilization: GPU usage percentage
Scaling Configuration
autoscaling:
min_instances: 1
max_instances: 50
target_utilization: 70%
scale_up_threshold: 10 requests queued
scale_down_delay: 5 minutes
Serverless GPU Architecture
Typical Provider Architecture
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────┬───────────────────────────────┘
│
┌─────────────────────────┴───────────────────────────────┐
│ API Gateway │
│ • Authentication │
│ • Rate limiting │
│ • Request routing │
└─────────────────────────┬───────────────────────────────┘
│
┌─────────────────────────┴───────────────────────────────┐
│ Orchestrator │
│ • Find available GPU │
│ • Start new instances │
│ • Queue management │
└─────────────────────────┬───────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│GPU Pod 1│ │GPU Pod 2│ │GPU Pod N│
│ Model A │ │ Model A │ │ Model B │
└─────────┘ └─────────┘ └─────────┘
When Serverless GPU Excels
1. Variable Traffic
Monday: 100 requests
Tuesday: 10,000 requests
Wednesday: 500 requests
Serverless: Pay for actual usage
Dedicated: Pay for peak capacity (10K/day)
2. Unpredictable Demand
Startup launching new feature:
"Will we get 100 or 100,000 users?"
Serverless: Scales automatically
Dedicated: Must provision upfront
3. Multiple Models
You need to serve:
- Model A (popular, 1M requests/day)
- Model B (occasional, 1K requests/day)
- Model C (rare, 10 requests/day)
Dedicated: 3 GPUs always running
Serverless: Pay only when models are used
4. Development and Testing
Development:
- Sporadic testing
- Hours between requests
- Multiple developers
Serverless: Pay only during tests
Dedicated: GPU sits idle 90% of time
5. Burst Capacity
Normal: 100 req/s
Black Friday: 10,000 req/s
Serverless: Auto-scales to handle burst
Dedicated: Need 100x over-provisioning
Serverless GPU Limitations
1. Latency-Sensitive Applications
Cold start latency (5-30s) is unacceptable for:
- Real-time chat
- Interactive applications
- Gaming
- Live streaming
2. Long-Running Workloads
Model Training:
- Runs for hours/days
- Needs persistent state
- GPU always utilized
→ Serverless overhead makes no sense
3. Large Model Loading
405B parameter model:
- 800GB model weights
- 15+ minutes to load
- Cold start is catastrophic
→ Must keep dedicated instances warm
4. Custom Infrastructure
You need:
- Specific CUDA version
- Custom libraries
- Special hardware configuration
→ Serverless platforms may not support
Summary
SERVERLESS GPU IS GREAT FOR:
✓ Variable/unpredictable traffic
✓ API-based inference
✓ Multi-model serving
✓ Development/testing
✓ Cost optimization (low utilization)
✓ Auto-scaling needs
SERVERLESS GPU IS NOT GREAT FOR:
✗ Latency-critical applications
✗ Long-running training
✗ Very large models
✗ Custom infrastructure needs
✗ High sustained utilization
What's Next?
In the next chapter, we'll explore dedicated GPU instances - when they make sense and how to use them effectively.