Chapter 1: Introduction

Introduction to GPU Computing Models

Understanding the fundamental differences between serverless and dedicated GPU computing - the two main paradigms for accessing GPU resources.

Introduction to GPU Computing Models

Before diving into specifics, let's understand the two fundamental approaches to GPU computing in the cloud.

The Core Difference

DEDICATED GPU:
"I rent a GPU by the hour. It's mine exclusively while I rent it."
→ Like renting an apartment

SERVERLESS GPU:
"I send requests. Someone's GPU handles them. I pay per request."
→ Like staying at a hotel

The Traditional Path: Dedicated Instances

This is the classic cloud computing model applied to GPUs:

┌─────────────────────────────────────────────────┐
│                  Your Dedicated GPU              │
│                                                 │
│  ┌─────────────────────────────────────────┐   │
│  │  You have:                               │   │
│  │  • SSH access                            │   │
│  │  • Install any software                  │   │
│  │  • Run anything (training, inference)    │   │
│  │  • Full GPU memory                       │   │
│  │  • GPU available 24/7 while renting      │   │
│  └─────────────────────────────────────────┘   │
│                                                 │
│  You pay: $X/hour regardless of usage           │
└─────────────────────────────────────────────────┘

Example: Running Inference

# On dedicated instance - always running
server = InferenceServer(model="llama-70b")
server.start()  # Server runs 24/7

# Every request uses the same server
result = server.inference(prompt)

The Modern Path: Serverless GPU

GPU compute is abstracted away - you just send requests:

┌─────────────────────────────────────────────────┐
│              Serverless GPU Platform             │
│                                                 │
│  ┌───────┐      ┌─────────┐      ┌───────┐    │
│  │Request├─────▶│ Magic   ├─────▶│Result │    │
│  │       │      │ Happens │      │       │    │
│  └───────┘      └─────────┘      └───────┘    │
│                                                 │
│  You DON'T know/care:                          │
│  • Which GPU handled it                         │
│  • Where the GPU is located                     │
│  • How many GPUs are available                  │
│  • Infrastructure management                    │
│                                                 │
│  You pay: $Y per request or per GPU-second      │
└─────────────────────────────────────────────────┘

Example: Running Inference

# Serverless - no server management
import float16

result = float16.inference(
    model="llama-70b",
    prompt="Hello, world!"
)
# GPU allocated → Request processed → GPU released

Quick Comparison

Aspect Dedicated Serverless
Billing Per hour Per request
Cold Start None (always on) Possible delay
Scaling Manual Automatic
Utilization You optimize Provider optimizes
Control Full Limited
Setup Complex Simple

Why This Matters

Utilization Problem

Most GPU workloads don't use GPU 100% of the time:

Dedicated Instance (24 hours):
┌────────────────────────────────────────────────┐
│████░░░░████░░░░░░░░████░░░░░░░░████░░░░░░░░░░░│
└────────────────────────────────────────────────┘
  ↑        ↑              ↑            ↑
 Burst   Burst          Burst        Burst

Actual GPU utilization: 20%
You paid for: 100% of the time
Wasted: 80% of your money

Serverless Solution

Serverless (Same workload):
┌────────────────────────────────────────────────┐
│Request → Process → Done (Pay for 50ms)         │
│    ...    (no cost while idle)    ...          │
│Request → Process → Done (Pay for 50ms)         │
└────────────────────────────────────────────────┘

You pay only for actual compute time.
No idle costs.

The Spectrum of Options

GPU access exists on a spectrum:

Most Control                             Least Control
Most Management                         Least Management

     │                                          │
     ▼                                          ▼
┌─────────┬───────────┬────────────┬───────────┐
│ Bare    │ Dedicated │ Managed    │ Serverless│
│ Metal   │ VM        │ Container  │ Function  │
│         │           │            │           │
│ Colo    │ IaaS      │ PaaS       │ FaaS/AaaS │
└─────────┴───────────┴────────────┴───────────┘

Real-World Analogy: Transportation

DEDICATED GPU = Owning a Car
• Always available
• Pay insurance, maintenance, parking even when parked
• Full control over routes, timing
• Best for daily commuting

SERVERLESS GPU = Uber/Lyft
• Available on demand
• Pay only per trip
• No maintenance worries
• Best for occasional trips

When Did Serverless GPU Emerge?

Timeline of GPU access evolution:

2012: GPU cloud (AWS GPU instances)
      → Rent dedicated GPUs by the hour

2017: Container-based GPU (NVIDIA Docker)
      → Deploy GPU containers easily

2020: Managed GPU services (SageMaker, etc.)
      → Simplified GPU deployment

2022+: Serverless GPU (Modal, RunPod, Float16)
      → Pay-per-request GPU inference

The Business Case

Startup Scenario

Traditional Approach:
- Rent 4x A100 GPUs: $12,000/month
- Average utilization: 15%
- Effective cost per used GPU-hour: $27.78

Serverless Approach:
- Pay per inference: $0.001/request
- 1M requests/month: $1,000/month
- Scale automatically with demand

Enterprise Scenario

Traditional Approach:
- Predictable workload
- 24/7 training jobs
- High utilization (>70%)
- Dedicated: Better economics

Serverless Approach:
- Good for inference APIs
- Variable traffic patterns
- Dev/test environments
- Burst capacity needs

What's Next?

In the next chapter, we'll dive deep into serverless GPU - how it works, cold starts, pricing models, and when it excels.