Chapter 2: Serverless GPU

Serverless GPU Deep Dive

Understanding how serverless GPU works - the technology, cold starts, pricing models, auto-scaling, and when it's the right choice.

Serverless GPU Deep Dive

Serverless GPU computing abstracts away infrastructure management. Let's understand how it works and when to use it.

How Serverless GPU Works

┌─────────────────────────────────────────────────────────┐
│                    Request Flow                          │
│                                                         │
│  1. Request → API Gateway                               │
│                    ↓                                    │
│  2. Find/Spin up GPU instance (if needed)               │
│                    ↓                                    │
│  3. Load model (if not cached)                          │
│                    ↓                                    │
│  4. Run inference                                       │
│                    ↓                                    │
│  5. Return result                                       │
│                    ↓                                    │
│  6. Keep warm or shut down                              │
└─────────────────────────────────────────────────────────┘

The Cold Start Problem

The biggest challenge with serverless GPU is cold starts:

What is a Cold Start?

WARM REQUEST (Container already running):
Request → Inference → Response
         └── 50ms ──┘

COLD REQUEST (Need to start container):
Request → Spin up → Load Model → Inference → Response
         └─ 5s ──┘ └── 10s ──┘ └── 50ms ──┘
                    Total: 15+ seconds

Why Cold Starts Happen

┌─────────────────────────────────────────────────┐
│              GPU Resource Pool                   │
│                                                 │
│  [Running] [Running] [Empty] [Empty] [Empty]   │
│      ↑         ↑                               │
│   Handling   Handling                          │
│   requests   requests                          │
│                                                │
│  New request arrives, no running instances:    │
│  → Must start new instance (COLD START)        │
└─────────────────────────────────────────────────┘

Cold Start Components

Component Time Can Optimize?
Instance allocation 2-10s Platform
Container start 1-5s Image size
Model loading 5-60s Model size
First inference 1-5s Model optimization

Mitigating Cold Starts

1. Keep Warm (Min Instances)

# Configure minimum running instances
deployment:
  min_instances: 2  # Always keep 2 warm
  max_instances: 10 # Scale up to 10

Trade-off: Pay for idle capacity.

2. Model Caching

First request: Load model from storage → GPU memory
              (Slow: 10-30 seconds)

Subsequent requests: Model already in GPU memory
                    (Fast: milliseconds)

3. Smaller Models

Model Size → Load Time
70B parameters → 30-60 seconds
7B parameters → 5-10 seconds
1B parameters → 1-2 seconds

4. Optimized Model Formats

Original Model: PyTorch (.pt)
→ Convert to: TensorRT, ONNX, or Triton
→ 2-5x faster loading

Serverless GPU Pricing Models

Pay-Per-Request

Pricing: $0.001 per request

Simple to understand
Best for: Consistent request sizes

Example:
1M requests × $0.001 = $1,000/month

Pay-Per-Second

Pricing: $0.0001 per GPU-second

Fairer for variable workloads
Best for: Mixed short/long requests

Example:
Request A: 0.5s × $0.0001 = $0.00005
Request B: 10s × $0.0001 = $0.001

Pay-Per-Token (LLM Specific)

Pricing:
- Input: $0.001 per 1K tokens
- Output: $0.002 per 1K tokens

Best for: LLM inference

Example:
1K input tokens + 500 output tokens
= $0.001 + $0.001 = $0.002

Auto-Scaling

Serverless platforms automatically scale based on demand:

Traffic Pattern:
                    ┌─────┐
                    │     │
             ┌──────┘     └──────┐
      ┌──────┘                   └──────┐
──────┘                                  └──────
  8AM                12PM                  8PM

GPU Instances:
                    [10 instances]
             [5 instances]    [5 instances]
      [2 instances]                [2 instances]
[1 instance]                              [1 instance]

Scaling Metrics

  • Queue depth: Requests waiting
  • Latency: Response time increasing
  • Utilization: GPU usage percentage

Scaling Configuration

autoscaling:
  min_instances: 1
  max_instances: 50
  target_utilization: 70%
  scale_up_threshold: 10 requests queued
  scale_down_delay: 5 minutes

Serverless GPU Architecture

Typical Provider Architecture

┌─────────────────────────────────────────────────────────┐
│                   Load Balancer                          │
└─────────────────────────┬───────────────────────────────┘
                          │
┌─────────────────────────┴───────────────────────────────┐
│                   API Gateway                            │
│  • Authentication                                       │
│  • Rate limiting                                        │
│  • Request routing                                      │
└─────────────────────────┬───────────────────────────────┘
                          │
┌─────────────────────────┴───────────────────────────────┐
│                   Orchestrator                           │
│  • Find available GPU                                   │
│  • Start new instances                                  │
│  • Queue management                                     │
└─────────────────────────┬───────────────────────────────┘
                          │
         ┌────────────────┼────────────────┐
         │                │                │
    ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
    │GPU Pod 1│     │GPU Pod 2│     │GPU Pod N│
    │ Model A │     │ Model A │     │ Model B │
    └─────────┘     └─────────┘     └─────────┘

When Serverless GPU Excels

1. Variable Traffic

Monday:    100 requests
Tuesday:   10,000 requests
Wednesday: 500 requests

Serverless: Pay for actual usage
Dedicated:  Pay for peak capacity (10K/day)

2. Unpredictable Demand

Startup launching new feature:
"Will we get 100 or 100,000 users?"

Serverless: Scales automatically
Dedicated:  Must provision upfront

3. Multiple Models

You need to serve:
- Model A (popular, 1M requests/day)
- Model B (occasional, 1K requests/day)
- Model C (rare, 10 requests/day)

Dedicated: 3 GPUs always running
Serverless: Pay only when models are used

4. Development and Testing

Development:
- Sporadic testing
- Hours between requests
- Multiple developers

Serverless: Pay only during tests
Dedicated:  GPU sits idle 90% of time

5. Burst Capacity

Normal:  100 req/s
Black Friday: 10,000 req/s

Serverless: Auto-scales to handle burst
Dedicated:  Need 100x over-provisioning

Serverless GPU Limitations

1. Latency-Sensitive Applications

Cold start latency (5-30s) is unacceptable for:
- Real-time chat
- Interactive applications
- Gaming
- Live streaming

2. Long-Running Workloads

Model Training:
- Runs for hours/days
- Needs persistent state
- GPU always utilized

→ Serverless overhead makes no sense

3. Large Model Loading

405B parameter model:
- 800GB model weights
- 15+ minutes to load
- Cold start is catastrophic

→ Must keep dedicated instances warm

4. Custom Infrastructure

You need:
- Specific CUDA version
- Custom libraries
- Special hardware configuration

→ Serverless platforms may not support

Summary

SERVERLESS GPU IS GREAT FOR:
✓ Variable/unpredictable traffic
✓ API-based inference
✓ Multi-model serving
✓ Development/testing
✓ Cost optimization (low utilization)
✓ Auto-scaling needs

SERVERLESS GPU IS NOT GREAT FOR:
✗ Latency-critical applications
✗ Long-running training
✗ Very large models
✗ Custom infrastructure needs
✗ High sustained utilization

What's Next?

In the next chapter, we'll explore dedicated GPU instances - when they make sense and how to use them effectively.