Serverless GPU Deep Dive
Serverless GPU computing abstracts away infrastructure management. Let's understand how it works and when to use it.
How Serverless GPU Works
┌─────────────────────────────────────────────────────────┐
│ Request Flow │
│ │
│ 1. Request → API Gateway │
│ ↓ │
│ 2. Find/Spin up GPU instance (if needed) │
│ ↓ │
│ 3. Load model (if not cached) │
│ ↓ │
│ 4. Run inference │
│ ↓ │
│ 5. Return result │
│ ↓ │
│ 6. Keep warm or shut down │
└─────────────────────────────────────────────────────────┘
The Cold Start Problem
The biggest challenge with serverless GPU is cold starts:
What is a Cold Start?
WARM REQUEST (Container already running):
Request → Inference → Response
└── 50ms ──┘
COLD REQUEST (Need to start container):
Request → Spin up → Load Model → Inference → Response
└─ 5s ──┘ └── 10s ──┘ └── 50ms ──┘
Total: 15+ seconds
Why Cold Starts Happen
┌─────────────────────────────────────────────────┐
│ GPU Resource Pool │
│ │
│ [Running] [Running] [Empty] [Empty] [Empty] │
│ ↑ ↑ │
│ Handling Handling │
│ requests requests │
│ │
│ New request arrives, no running instances: │
│ → Must start new instance (COLD START) │
└─────────────────────────────────────────────────┘
Cold Start Components
| Component | Time | Can Optimize? |
|---|---|---|
| Instance allocation | 2-10s | Platform |
| Container start | 1-5s | Image size |
| Model loading | 5-60s | Model size |
| First inference | 1-5s | Model optimization |
Mitigating Cold Starts
1. Keep Warm (Min Instances)
# Configure minimum running instances
deployment:
min_instances: 2 # Always keep 2 warm
max_instances: 10 # Scale up to 10
Trade-off: Pay for idle capacity.
2. Model Caching
First request: Load model from storage → GPU memory
(Slow: 10-30 seconds)
Subsequent requests: Model already in GPU memory
(Fast: milliseconds)
3. Smaller Models
Model Size → Load Time
70B parameters → 30-60 seconds
7B parameters → 5-10 seconds
1B parameters → 1-2 seconds
4. Optimized Model Formats
Original Model: PyTorch (.pt)
→ Convert to: TensorRT, ONNX, or Triton
→ 2-5x faster loading
Serverless GPU Pricing Models
Pay-Per-Request
Pricing: $0.001 per request
Simple to understand
Best for: Consistent request sizes
Example:
1M requests × $0.001 = $1,000/month
Pay-Per-Second
Pricing: $0.0001 per GPU-second
Fairer for variable workloads
Best for: Mixed short/long requests
Example:
Request A: 0.5s × $0.0001 = $0.00005
Request B: 10s × $0.0001 = $0.001
Pay-Per-Token (LLM Specific)
Pricing:
- Input: $0.001 per 1K tokens
- Output: $0.002 per 1K tokens
Best for: LLM inference
Example:
1K input tokens + 500 output tokens
= $0.001 + $0.001 = $0.002
Auto-Scaling
Serverless platforms automatically scale based on demand:
Traffic Pattern:
┌─────┐
│ │
┌──────┘ └──────┐
┌──────┘ └──────┐
──────┘ └──────
8AM 12PM 8PM
GPU Instances:
[10 instances]
[5 instances] [5 instances]
[2 instances] [2 instances]
[1 instance] [1 instance]
Scaling Metrics
- Queue depth: Requests waiting
- Latency: Response time increasing
- Utilization: GPU usage percentage
Scaling Configuration
autoscaling:
min_instances: 1
max_instances: 50
target_utilization: 70%
scale_up_threshold: 10 requests queued
scale_down_delay: 5 minutes
Serverless GPU Architecture
Typical Provider Architecture
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────┬───────────────────────────────┘
│
┌─────────────────────────┴───────────────────────────────┐
│ API Gateway │
│ • Authentication │
│ • Rate limiting │
│ • Request routing │
└─────────────────────────┬───────────────────────────────┘
│
┌─────────────────────────┴───────────────────────────────┐
│ Orchestrator │
│ • Find available GPU │
│ • Start new instances │
│ • Queue management │
└─────────────────────────┬───────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│GPU Pod 1│ │GPU Pod 2│ │GPU Pod N│
│ Model A │ │ Model A │ │ Model B │
└─────────┘ └─────────┘ └─────────┘
When Serverless GPU Excels
1. Variable Traffic
Monday: 100 requests
Tuesday: 10,000 requests
Wednesday: 500 requests
Serverless: Pay for actual usage
Dedicated: Pay for peak capacity (10K/day)
2. Unpredictable Demand
Startup launching new feature:
"Will we get 100 or 100,000 users?"
Serverless: Scales automatically
Dedicated: Must provision upfront
3. Multiple Models
You need to serve:
- Model A (popular, 1M requests/day)
- Model B (occasional, 1K requests/day)
- Model C (rare, 10 requests/day)
Dedicated: 3 GPUs always running
Serverless: Pay only when models are used
4. Development and Testing
Development:
- Sporadic testing
- Hours between requests
- Multiple developers
Serverless: Pay only during tests
Dedicated: GPU sits idle 90% of time
5. Burst Capacity
Normal: 100 req/s
Black Friday: 10,000 req/s
Serverless: Auto-scales to handle burst
Dedicated: Need 100x over-provisioning
Serverless GPU Limitations
1. Latency-Sensitive Applications
Cold start latency (5-30s) is unacceptable for:
- Real-time chat
- Interactive applications
- Gaming
- Live streaming
2. Long-Running Workloads
Model Training:
- Runs for hours/days
- Needs persistent state
- GPU always utilized
→ Serverless overhead makes no sense
3. Large Model Loading
405B parameter model:
- 800GB model weights
- 15+ minutes to load
- Cold start is catastrophic
→ Must keep dedicated instances warm
4. Custom Infrastructure
You need:
- Specific CUDA version
- Custom libraries
- Special hardware configuration
→ Serverless platforms may not support
Summary
SERVERLESS GPU IS GREAT FOR:
✓ Variable/unpredictable traffic
✓ API-based inference
✓ Multi-model serving
✓ Development/testing
✓ Cost optimization (low utilization)
✓ Auto-scaling needs
SERVERLESS GPU IS NOT GREAT FOR:
✗ Latency-critical applications
✗ Long-running training
✗ Very large models
✗ Custom infrastructure needs
✗ High sustained utilization
What's Next?
In the next chapter, we'll explore dedicated GPU instances - when they make sense and how to use them effectively.