Chapter 5: GPU Workloads

GPU Workloads: VMs vs Containers

Understanding how Virtual Machines and Containers handle GPU resources for AI/ML workloads, and choosing the right approach for your GPU computing needs.

GPU Workloads: VMs vs Containers

GPU computing for AI/ML has unique requirements. Let's explore how VMs and Containers handle GPUs and which approach works best for different scenarios.

GPU Virtualization Technologies

GPU Passthrough (VMs)

The entire GPU is assigned to a single VM:

┌─────────────────────────────────────┐
│           VM with GPU               │
│  ┌─────────────────────────────┐    │
│  │     AI/ML Application       │    │
│  ├─────────────────────────────┤    │
│  │     CUDA / cuDNN            │    │
│  ├─────────────────────────────┤    │
│  │     NVIDIA Driver           │    │
│  └─────────────────────────────┘    │
├─────────────────────────────────────┤
│          Hypervisor                 │
│              ↓                      │
│         [PCIe Passthrough]          │
│              ↓                      │
│          GPU (A100)                 │
└─────────────────────────────────────┘

Pros:

  • Full GPU performance (100%)
  • All GPU features available
  • Mature, stable technology

Cons:

  • One GPU per VM only
  • GPU memory not shareable
  • Expensive (dedicated GPU per workload)

vGPU (Virtual GPU)

GPU is partitioned and shared among multiple VMs:

┌─────────┐ ┌─────────┐ ┌─────────┐
│   VM1   │ │   VM2   │ │   VM3   │
│  vGPU   │ │  vGPU   │ │  vGPU   │
│  (8GB)  │ │  (8GB)  │ │  (8GB)  │
└────┬────┘ └────┬────┘ └────┬────┘
     └───────────┼───────────┘
          ┌──────┴──────┐
          │   vGPU      │
          │  Manager    │
          └──────┬──────┘
          ┌──────┴──────┐
          │  A100 (24GB)│
          └─────────────┘

Technologies:

  • NVIDIA vGPU (Grid)
  • NVIDIA MIG (Multi-Instance GPU)
  • AMD MxGPU

Pros:

  • Share expensive GPU hardware
  • Isolated GPU memory per VM
  • Good for inference workloads

Cons:

  • Licensing costs (NVIDIA vGPU)
  • Performance overhead (5-15%)
  • Not all GPUs support it

NVIDIA MIG (Multi-Instance GPU)

Hardware-level GPU partitioning (A100, H100):

┌─────────────────────────────────────┐
│            NVIDIA A100              │
├───────────┬───────────┬─────────────┤
│  MIG 1    │  MIG 2    │   MIG 3     │
│  (1g.5gb) │  (2g.10gb)│   (3g.20gb) │
│           │           │             │
│  └─VM1    │  └─VM2    │   └─VM3     │
└───────────┴───────────┴─────────────┘

Pros:

  • Hardware isolation
  • Guaranteed resources
  • No vGPU licensing

Cons:

  • Only high-end GPUs (A100, H100)
  • Fixed partition sizes
  • Limited flexibility

GPUs in Containers

NVIDIA Container Toolkit

Containers access GPUs through the NVIDIA Container Toolkit:

┌─────────────────────────────────────┐
│         Container                    │
│  ┌─────────────────────────────┐    │
│  │     AI/ML Application       │    │
│  ├─────────────────────────────┤    │
│  │     CUDA / cuDNN            │    │
│  └─────────────────────────────┘    │
├─────────────────────────────────────┤
│     NVIDIA Container Runtime        │
├─────────────────────────────────────┤
│     NVIDIA Driver (Host)            │
├─────────────────────────────────────┤
│            GPU                       │
└─────────────────────────────────────┘
# Run container with GPU access
docker run --gpus all nvidia/cuda:12.0-base nvidia-smi

Pros:

  • Near-native GPU performance
  • Easy to set up
  • Works with any NVIDIA GPU

Cons:

  • Shared GPU memory space
  • No hardware isolation
  • Requires careful resource management

Time-Slicing GPUs

Multiple containers share GPU via time-slicing:

Timeline:
├── Container A ──┤── Container B ──┤── Container A ──┤
     (100ms)           (100ms)           (100ms)

Good for:

  • Development environments
  • Light inference workloads
  • Cost optimization

Not good for:

  • Training large models
  • Real-time inference
  • Predictable latency

Performance Comparison

GPU Passthrough VM vs Container

Metric VM (Passthrough) Container
GPU Performance 100% ~100%
Setup Complexity High Low
Flexibility Low High
Isolation Hardware Process
Memory Overhead High (VM OS) Low

Benchmark: Training BERT

Environment: NVIDIA A100, 40GB
Workload: Fine-tuning BERT-large

┌──────────────────────────────────────┐
│ VM (GPU Passthrough)    │  98.5%     │
│ Container (native)      │  99.8%     │
│ Container (time-slice)  │  45-70%    │
│ VM (vGPU)               │  85-92%    │
└──────────────────────────────────────┘
          Performance vs Bare Metal

Use Case Recommendations

Choose VMs with GPU Passthrough When:

  • Multi-tenant GPU cloud - Different customers need isolation
  • Compliance requirements - PCI-DSS, HIPAA workloads
  • Long-running training - Dedicated GPU for days/weeks
  • Windows GPU workloads - CUDA on Windows

Choose VMs with vGPU/MIG When:

  • Inference services - Many small models
  • Development environments - Shared GPU for developers
  • GPU oversubscription - More users than GPUs

Choose Containers for GPU When:

  • Kubernetes deployments - Cloud-native ML platforms
  • CI/CD pipelines - Quick GPU testing
  • Batch processing - Short-lived GPU jobs
  • Microservices - Inference APIs at scale

Float16's Approach

At Float16, we offer multiple GPU access patterns:

IaaS (VM-style)

- Full GPU passthrough
- SSH access
- Install anything
- Best for: Training, custom environments

PaaS (Container-style)

- Managed containers
- GPU-enabled pods
- Pre-configured environments
- Best for: Deployment, scaling

AaaS (API-style)

- No GPU management
- REST API access
- Pay per request
- Best for: Quick inference, prototypes

Best Practices for GPU Workloads

1. Match Workload to Platform

Workload Recommended Why
LLM Training VM/IaaS Long-running, needs stability
Model Serving Container Scale up/down quickly
Development Container/vGPU Cost-efficient sharing
Production Inference Container + K8s Auto-scaling, orchestration

2. Optimize GPU Memory

# Don't load everything into GPU memory
model = load_model()
model.to("cuda")  # Only when needed

# Use gradient checkpointing for training
model.gradient_checkpointing_enable()

# Clear cache when done
torch.cuda.empty_cache()

3. Use GPU Monitoring

Track GPU utilization to right-size instances:

# Real-time monitoring
nvidia-smi dmon -s u

# Key metrics to watch
# - GPU Utilization
# - Memory Usage
# - Temperature

Summary

Approach Isolation Performance Flexibility Cost
VM + Passthrough Excellent 100% Low High
VM + vGPU Good 85-95% Medium Medium
VM + MIG Excellent 95-100% Low Medium
Container Process 99-100% High Low
Container + Time-slice Process Variable High Lowest

For most AI/ML workloads today, containers with proper GPU access provide the best balance of performance, flexibility, and cost. Use VMs when strong isolation or specific OS requirements are needed.

Congratulations!

You've completed the VM vs Container course! You now understand:

  • How VMs and Containers work
  • Their strengths and weaknesses
  • How they handle GPU workloads
  • When to use each approach

Ready to apply this knowledge? Explore Float16's GPU platform to deploy your AI workloads using the right technology for your needs.

🎉 Congratulations!

You've completed all chapters in VM vs Container

Explore More Courses