Chapter 5 of 5•6 min read

GPU Workloads: VMs vs Containers

Understanding how Virtual Machines and Containers handle GPU resources for AI/ML workloads, and choosing the right approach for your GPU computing needs.

GPU Workloads: VMs vs Containers

GPU computing for AI/ML has unique requirements. Let's explore how VMs and Containers handle GPUs and which approach works best for different scenarios.

GPU Virtualization Technologies

GPU Passthrough (VMs)

The entire GPU is assigned to a single VM:

┌─────────────────────────────────────┐
│           VM with GPU               │
│  ┌─────────────────────────────┐    │
│  │     AI/ML Application       │    │
│  ├─────────────────────────────┤    │
│  │     CUDA / cuDNN            │    │
│  ├─────────────────────────────┤    │
│  │     NVIDIA Driver           │    │
│  └─────────────────────────────┘    │
├─────────────────────────────────────┤
│          Hypervisor                 │
│              ↓                      │
│         [PCIe Passthrough]          │
│              ↓                      │
│          GPU (A100)                 │
└─────────────────────────────────────┘

Pros:

Full GPU performance (100%)
All GPU features available
Mature, stable technology

Cons:

One GPU per VM only
GPU memory not shareable
Expensive (dedicated GPU per workload)

vGPU (Virtual GPU)

GPU is partitioned and shared among multiple VMs:

┌─────────┐ ┌─────────┐ ┌─────────┐
│   VM1   │ │   VM2   │ │   VM3   │
│  vGPU   │ │  vGPU   │ │  vGPU   │
│  (8GB)  │ │  (8GB)  │ │  (8GB)  │
└────┬────┘ └────┬────┘ └────┬────┘
     └───────────┼───────────┘
          ┌──────┴──────┐
          │   vGPU      │
          │  Manager    │
          └──────┬──────┘
          ┌──────┴──────┐
          │  A100 (24GB)│
          └─────────────┘

Technologies:

NVIDIA vGPU (Grid)
NVIDIA MIG (Multi-Instance GPU)
AMD MxGPU

Pros:

Share expensive GPU hardware
Isolated GPU memory per VM
Good for inference workloads

Cons:

Licensing costs (NVIDIA vGPU)
Performance overhead (5-15%)
Not all GPUs support it

NVIDIA MIG (Multi-Instance GPU)

Hardware-level GPU partitioning (A100, H100):

┌─────────────────────────────────────┐
│            NVIDIA A100              │
├───────────┬───────────┬─────────────┤
│  MIG 1    │  MIG 2    │   MIG 3     │
│  (1g.5gb) │  (2g.10gb)│   (3g.20gb) │
│           │           │             │
│  └─VM1    │  └─VM2    │   └─VM3     │
└───────────┴───────────┴─────────────┘

Pros:

Hardware isolation
Guaranteed resources
No vGPU licensing

Cons:

Only high-end GPUs (A100, H100)
Fixed partition sizes
Limited flexibility

GPUs in Containers

NVIDIA Container Toolkit

Containers access GPUs through the NVIDIA Container Toolkit:

┌─────────────────────────────────────┐
│         Container                    │
│  ┌─────────────────────────────┐    │
│  │     AI/ML Application       │    │
│  ├─────────────────────────────┤    │
│  │     CUDA / cuDNN            │    │
│  └─────────────────────────────┘    │
├─────────────────────────────────────┤
│     NVIDIA Container Runtime        │
├─────────────────────────────────────┤
│     NVIDIA Driver (Host)            │
├─────────────────────────────────────┤
│            GPU                       │
└─────────────────────────────────────┘

# Run container with GPU access
docker run --gpus all nvidia/cuda:12.0-base nvidia-smi

Pros:

Near-native GPU performance
Easy to set up
Works with any NVIDIA GPU

Cons:

Shared GPU memory space
No hardware isolation
Requires careful resource management

Time-Slicing GPUs

Multiple containers share GPU via time-slicing:

Timeline:
├── Container A ──┤── Container B ──┤── Container A ──┤
     (100ms)           (100ms)           (100ms)

Good for:

Development environments
Light inference workloads
Cost optimization

Not good for:

Training large models
Real-time inference
Predictable latency

Performance Comparison

GPU Passthrough VM vs Container

Metric	VM (Passthrough)	Container
GPU Performance	100%	~100%
Setup Complexity	High	Low
Flexibility	Low	High
Isolation	Hardware	Process
Memory Overhead	High (VM OS)	Low

Benchmark: Training BERT

Environment: NVIDIA A100, 40GB
Workload: Fine-tuning BERT-large

┌──────────────────────────────────────┐
│ VM (GPU Passthrough)    │  98.5%     │
│ Container (native)      │  99.8%     │
│ Container (time-slice)  │  45-70%    │
│ VM (vGPU)               │  85-92%    │
└──────────────────────────────────────┘
          Performance vs Bare Metal

Use Case Recommendations

Choose VMs with GPU Passthrough When:

Multi-tenant GPU cloud - Different customers need isolation
Compliance requirements - PCI-DSS, HIPAA workloads
Long-running training - Dedicated GPU for days/weeks
Windows GPU workloads - CUDA on Windows

Choose VMs with vGPU/MIG When:

Inference services - Many small models
Development environments - Shared GPU for developers
GPU oversubscription - More users than GPUs

Choose Containers for GPU When:

Kubernetes deployments - Cloud-native ML platforms
CI/CD pipelines - Quick GPU testing
Batch processing - Short-lived GPU jobs
Microservices - Inference APIs at scale

Float16's Approach

At Float16, we offer multiple GPU access patterns:

IaaS (VM-style)

- Full GPU passthrough
- SSH access
- Install anything
- Best for: Training, custom environments

PaaS (Container-style)

- Managed containers
- GPU-enabled pods
- Pre-configured environments
- Best for: Deployment, scaling

AaaS (API-style)

- No GPU management
- REST API access
- Pay per request
- Best for: Quick inference, prototypes

Best Practices for GPU Workloads

1. Match Workload to Platform

Workload	Recommended	Why
LLM Training	VM/IaaS	Long-running, needs stability
Model Serving	Container	Scale up/down quickly
Development	Container/vGPU	Cost-efficient sharing
Production Inference	Container + K8s	Auto-scaling, orchestration

2. Optimize GPU Memory

# Don't load everything into GPU memory
model = load_model()
model.to("cuda")  # Only when needed

# Use gradient checkpointing for training
model.gradient_checkpointing_enable()

# Clear cache when done
torch.cuda.empty_cache()

3. Use GPU Monitoring

Track GPU utilization to right-size instances:

# Real-time monitoring
nvidia-smi dmon -s u

# Key metrics to watch
# - GPU Utilization
# - Memory Usage
# - Temperature

Summary

Approach	Isolation	Performance	Flexibility	Cost
VM + Passthrough	Excellent	100%	Low	High
VM + vGPU	Good	85-95%	Medium	Medium
VM + MIG	Excellent	95-100%	Low	Medium
Container	Process	99-100%	High	Low
Container + Time-slice	Process	Variable	High	Lowest

For most AI/ML workloads today, containers with proper GPU access provide the best balance of performance, flexibility, and cost. Use VMs when strong isolation or specific OS requirements are needed.

Congratulations!

You've completed the VM vs Container course! You now understand:

How VMs and Containers work
Their strengths and weaknesses
How they handle GPU workloads
When to use each approach

Ready to apply this knowledge? Explore Float16's GPU platform to deploy your AI workloads using the right technology for your needs.