Multi-Tenancy in GPU Cloud Platforms
GPU workloads present unique multi-tenancy challenges. GPUs are expensive, powerful, and traditionally designed for single-user access. Let's explore how modern GPU cloud platforms handle multi-tenancy.
GPU Multi-Tenancy Challenges
Challenge 1: GPU Scarcity
GPUs are expensive and in high demand:
Traditional Multi-Tenancy:
- Servers: $5,000-$20,000 each
- Can run 10-100 tenants per server
- Cost per tenant: $50-$2,000
GPU Multi-Tenancy:
- GPU Server: $50,000-$500,000 each
- GPUs: $10,000-$40,000 per GPU
- Must maximize utilization to justify cost
Challenge 2: GPU Memory Isolation
Unlike CPU memory, GPU memory is harder to virtualize:
CPU Memory: GPU Memory:
┌─────────────┐ ┌─────────────┐
│ Tenant A │ │ Shared VRAM │
│ (Isolated) │ │ (Harder to │
├─────────────┤ │ isolate) │
│ Tenant B │ │ │
│ (Isolated) │ └─────────────┘
└─────────────┘
Challenge 3: Long-Running Workloads
GPU workloads often run for hours or days:
Web Request: 50-200ms
GPU Training: 2 hours - 2 weeks
Traditional resource sharing doesn't work well
for jobs that hold resources for days.
Challenge 4: All-or-Nothing Execution
Many GPU workloads need the entire GPU:
Traditional Workload:
- "I need 100MB memory" → Get exactly that
GPU Workload:
- "I need 79GB VRAM" (H100 has 80GB)
- Can't easily share remaining 1GB
GPU Sharing Technologies
1. Time-Slicing (Context Switching)
GPUs rapidly switch between tenants:
Timeline:
├─ Tenant A ─┤─ Tenant B ─┤─ Tenant A ─┤─ Tenant B ─┤
10ms 10ms 10ms 10ms
Pros:
- Simple to implement
- Works on any GPU
Cons:
- Context switch overhead
- All tenants share same memory
- No guaranteed performance
2. NVIDIA MPS (Multi-Process Service)
Multiple processes share GPU simultaneously:
┌───────────────────────────────────────┐
│ GPU │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Process 1│ │Process 2│ │Process 3│ │
│ │(Tenant A)│(Tenant B)│(Tenant C)│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ MPS Server │
└───────────────────────────────────────┘
Pros:
- Better utilization
- Lower overhead than time-slicing
Cons:
- Memory not isolated
- One crash affects all
- Same CUDA version required
3. NVIDIA MIG (Multi-Instance GPU)
Hardware-level GPU partitioning (A100, H100):
┌─────────────────────────────────────────────────┐
│ NVIDIA A100 (80GB) │
├─────────────┬─────────────┬─────────────────────┤
│ MIG 1 │ MIG 2 │ MIG 3 │
│ (1g.10gb) │ (2g.20gb) │ (4g.40gb) │
│ Tenant A │ Tenant B │ Tenant C │
│ │ │ │
│ Isolated │ Isolated │ Isolated │
│ Memory │ Memory │ Memory │
│ Compute │ Compute │ Compute │
└─────────────┴─────────────┴─────────────────────┘
Pros:
- Hardware isolation
- Guaranteed resources
- Fault isolation
Cons:
- Only high-end GPUs (A100, H100, A30)
- Fixed partition sizes
- Reduced flexibility
4. vGPU (Virtual GPU)
Software-based GPU virtualization:
┌─────────────────────────────────────────────────┐
│ VMs │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ VM 1 │ │ VM 2 │ │ VM 3 │ │
│ │ vGPU A │ │ vGPU B │ │ vGPU C │ │
│ └─────────┘ └─────────┘ └─────────┘ │
├─────────────────────────────────────────────────┤
│ vGPU Manager (NVIDIA Grid) │
├─────────────────────────────────────────────────┤
│ Physical GPU │
└─────────────────────────────────────────────────┘
Pros:
- Strong isolation (VM boundary)
- Works with existing VM infrastructure
- Quality of Service controls
Cons:
- Licensing costs
- 5-15% performance overhead
- Complex setup
GPU Scheduling Strategies
Job Queue Based
┌─────────────────────────────────────────────────┐
│ GPU Job Queue │
│ │
│ [Train-A] [Inference-B] [Train-C] [Eval-D] │
│ ↓ │
│ ┌─────────────────────────────────────────┐ │
│ │ Scheduler │ │
│ │ • Assign jobs to available GPUs │ │
│ │ • Respect tenant quotas │ │
│ │ • Optimize utilization │ │
│ └─────────────────────────────────────────┘ │
│ ↓ ↓ ↓ ↓ │
│ [GPU 1] [GPU 2] [GPU 3] [GPU 4] │
└─────────────────────────────────────────────────┘
Preemption
Allow high-priority jobs to interrupt lower priority:
Before Preemption:
GPU 1: [Low Priority Job - 6 hours remaining]
High Priority Job Arrives:
GPU 1: [Low Priority] → CHECKPOINT → STOP
GPU 1: [High Priority Job] → RUNNING
[Low Priority] → QUEUED (resume later)
Fair Share Scheduling
Tenant Quotas:
- Tenant A: 40% of GPU hours
- Tenant B: 35% of GPU hours
- Tenant C: 25% of GPU hours
Scheduler ensures:
- Long-term usage matches quotas
- Short-term borrowing allowed
- No tenant starves
Float16's Multi-Tenancy Approach
Three-Layer Isolation Model
┌─────────────────────────────────────────────────┐
│ AaaS Layer │
│ (Highest Isolation) │
│ • API-only access │
│ • No direct GPU access │
│ • Request-based billing │
│ • Full infrastructure abstraction │
├─────────────────────────────────────────────────┤
│ PaaS Layer │
│ (Container Isolation) │
│ • Tenant containers with GPU access │
│ • Managed Kubernetes │
│ • Resource quotas enforced │
│ • Shared cluster, isolated namespaces │
├─────────────────────────────────────────────────┤
│ IaaS Layer │
│ (VM Isolation) │
│ • Dedicated VMs with GPU passthrough │
│ • Full SSH access │
│ • Tenant controls environment │
│ • Strongest isolation │
└─────────────────────────────────────────────────┘
Security Boundaries
┌─────────────────────────────────────────────────┐
│ Tenant A │
│ ┌─────────────────────────────────────┐ │
│ │ Network Namespace │ │
│ │ • Isolated network │ │
│ │ • Private IP range │ │
│ └─────────────────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Storage Namespace │ │
│ │ • Encrypted volumes │ │
│ │ • Tenant-specific keys │ │
│ └─────────────────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Compute Resources │ │
│ │ • Dedicated or shared GPU │ │
│ │ • Resource limits enforced │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
GPU Resource Management
# Tenant Resource Configuration
tenant_config:
name: "acme-corp"
tier: "pro"
gpu_quota:
max_gpus: 8
max_gpu_hours_per_month: 1000
allowed_gpu_types:
- "A100-40GB"
- "A100-80GB"
- "H100"
compute_quota:
max_vcpus: 64
max_memory_gb: 256
storage_quota:
max_storage_gb: 1000
max_snapshots: 10
network:
private_subnet: "10.100.0.0/24"
egress_limit_gbps: 10
Billing and Metering
┌─────────────────────────────────────────────────┐
│ Per-Tenant Usage Tracking │
├─────────────────────────────────────────────────┤
│ Metric │ Unit │ This Month │
├──────────────────────┼─────────────┼────────────┤
│ GPU Hours (A100) │ hours │ 450 │
│ GPU Hours (H100) │ hours │ 120 │
│ Inference Requests │ thousands │ 15,000 │
│ Storage │ GB-months │ 500 │
│ Network Egress │ GB │ 250 │
│ Snapshots │ count │ 8 │
└──────────────────────┴─────────────┴────────────┘
Best Practices for GPU Multi-Tenancy
For Platform Providers
-
Implement defense in depth
- Container isolation + network isolation + storage encryption
-
Use hardware isolation when possible
- MIG for high-security tenants
- GPU passthrough for dedicated instances
-
Monitor and alert on anomalies
- Unusual GPU memory access patterns
- Cross-tenant network traffic
- Resource quota violations
For Tenants
-
Right-size your GPU needs
- Don't request H100 for inference workloads
- Use serverless for variable loads
-
Implement checkpointing
- Save model state regularly
- Enable preemption tolerance
-
Use the right isolation level
- API access (AaaS) for standard inference
- Containers (PaaS) for custom models
- VMs (IaaS) for sensitive training
Summary
GPU multi-tenancy is more complex than traditional cloud multi-tenancy:
| Aspect | Traditional Cloud | GPU Cloud |
|---|---|---|
| Resource Granularity | Fine-grained | Coarse-grained |
| Sharing Overhead | Low | Higher |
| Isolation Options | Mature | Evolving |
| Cost Efficiency | High | Lower |
| Workload Duration | Seconds-minutes | Hours-days |
Float16's layered approach (AaaS/PaaS/IaaS) provides the right balance of isolation, performance, and cost for different use cases.
Congratulations!
You've completed the Multi-Tenancy course! You now understand:
- How multi-tenancy works in cloud computing
- Different isolation models and their trade-offs
- Security and resource management challenges
- How GPU clouds implement multi-tenancy
Ready to explore more? Check out our Serverless GPU course to learn about modern GPU deployment options.