Chapter 5 of 5•7 min read

Multi-Tenancy in GPU Cloud Platforms

How GPU cloud providers implement multi-tenancy for AI/ML workloads - unique challenges, solutions, and Float16's approach.

Multi-Tenancy in GPU Cloud Platforms

GPU workloads present unique multi-tenancy challenges. GPUs are expensive, powerful, and traditionally designed for single-user access. Let's explore how modern GPU cloud platforms handle multi-tenancy.

GPU Multi-Tenancy Challenges

Challenge 1: GPU Scarcity

GPUs are expensive and in high demand:

Traditional Multi-Tenancy:
- Servers: $5,000-$20,000 each
- Can run 10-100 tenants per server
- Cost per tenant: $50-$2,000

GPU Multi-Tenancy:
- GPU Server: $50,000-$500,000 each
- GPUs: $10,000-$40,000 per GPU
- Must maximize utilization to justify cost

Challenge 2: GPU Memory Isolation

Unlike CPU memory, GPU memory is harder to virtualize:

CPU Memory:              GPU Memory:
┌─────────────┐          ┌─────────────┐
│ Tenant A    │          │ Shared VRAM │
│ (Isolated)  │          │ (Harder to  │
├─────────────┤          │  isolate)   │
│ Tenant B    │          │             │
│ (Isolated)  │          └─────────────┘
└─────────────┘

Challenge 3: Long-Running Workloads

GPU workloads often run for hours or days:

Web Request: 50-200ms
GPU Training: 2 hours - 2 weeks

Traditional resource sharing doesn't work well
for jobs that hold resources for days.

Challenge 4: All-or-Nothing Execution

Many GPU workloads need the entire GPU:

Traditional Workload:
- "I need 100MB memory" → Get exactly that

GPU Workload:
- "I need 79GB VRAM" (H100 has 80GB)
- Can't easily share remaining 1GB

1. Time-Slicing (Context Switching)

GPUs rapidly switch between tenants:

Timeline:
├─ Tenant A ─┤─ Tenant B ─┤─ Tenant A ─┤─ Tenant B ─┤
    10ms         10ms         10ms         10ms

Pros:
- Simple to implement
- Works on any GPU

Cons:
- Context switch overhead
- All tenants share same memory
- No guaranteed performance

2. NVIDIA MPS (Multi-Process Service)

Multiple processes share GPU simultaneously:

┌───────────────────────────────────────┐
│              GPU                       │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐  │
│  │Process 1│ │Process 2│ │Process 3│  │
│  │(Tenant A)│(Tenant B)│(Tenant C)│  │
│  └─────────┘ └─────────┘ └─────────┘  │
│         MPS Server                     │
└───────────────────────────────────────┘

Pros:
- Better utilization
- Lower overhead than time-slicing

Cons:
- Memory not isolated
- One crash affects all
- Same CUDA version required

3. NVIDIA MIG (Multi-Instance GPU)

Hardware-level GPU partitioning (A100, H100):

┌─────────────────────────────────────────────────┐
│                 NVIDIA A100 (80GB)              │
├─────────────┬─────────────┬─────────────────────┤
│   MIG 1     │   MIG 2     │      MIG 3          │
│  (1g.10gb)  │  (2g.20gb)  │     (4g.40gb)       │
│   Tenant A  │   Tenant B  │     Tenant C        │
│             │             │                     │
│ Isolated    │ Isolated    │    Isolated         │
│ Memory      │ Memory      │    Memory           │
│ Compute     │ Compute     │    Compute          │
└─────────────┴─────────────┴─────────────────────┘

Pros:
- Hardware isolation
- Guaranteed resources
- Fault isolation

Cons:
- Only high-end GPUs (A100, H100, A30)
- Fixed partition sizes
- Reduced flexibility

4. vGPU (Virtual GPU)

Software-based GPU virtualization:

┌─────────────────────────────────────────────────┐
│                    VMs                           │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │  VM 1   │  │  VM 2   │  │  VM 3   │         │
│  │ vGPU A  │  │ vGPU B  │  │ vGPU C  │         │
│  └─────────┘  └─────────┘  └─────────┘         │
├─────────────────────────────────────────────────┤
│           vGPU Manager (NVIDIA Grid)            │
├─────────────────────────────────────────────────┤
│                Physical GPU                      │
└─────────────────────────────────────────────────┘

Pros:
- Strong isolation (VM boundary)
- Works with existing VM infrastructure
- Quality of Service controls

Cons:
- Licensing costs
- 5-15% performance overhead
- Complex setup

GPU Scheduling Strategies

Job Queue Based

┌─────────────────────────────────────────────────┐
│              GPU Job Queue                       │
│                                                 │
│  [Train-A] [Inference-B] [Train-C] [Eval-D]    │
│      ↓                                          │
│  ┌─────────────────────────────────────────┐   │
│  │           Scheduler                      │   │
│  │  • Assign jobs to available GPUs         │   │
│  │  • Respect tenant quotas                 │   │
│  │  • Optimize utilization                  │   │
│  └─────────────────────────────────────────┘   │
│      ↓          ↓          ↓          ↓        │
│   [GPU 1]    [GPU 2]    [GPU 3]    [GPU 4]    │
└─────────────────────────────────────────────────┘

Preemption

Allow high-priority jobs to interrupt lower priority:

Before Preemption:
GPU 1: [Low Priority Job - 6 hours remaining]

High Priority Job Arrives:
GPU 1: [Low Priority] → CHECKPOINT → STOP
GPU 1: [High Priority Job] → RUNNING
       [Low Priority] → QUEUED (resume later)

Tenant Quotas:
- Tenant A: 40% of GPU hours
- Tenant B: 35% of GPU hours
- Tenant C: 25% of GPU hours

Scheduler ensures:
- Long-term usage matches quotas
- Short-term borrowing allowed
- No tenant starves

Float16's Multi-Tenancy Approach

Three-Layer Isolation Model

┌─────────────────────────────────────────────────┐
│                   AaaS Layer                     │
│              (Highest Isolation)                 │
│  • API-only access                              │
│  • No direct GPU access                         │
│  • Request-based billing                        │
│  • Full infrastructure abstraction              │
├─────────────────────────────────────────────────┤
│                   PaaS Layer                     │
│              (Container Isolation)               │
│  • Tenant containers with GPU access            │
│  • Managed Kubernetes                           │
│  • Resource quotas enforced                     │
│  • Shared cluster, isolated namespaces          │
├─────────────────────────────────────────────────┤
│                   IaaS Layer                     │
│               (VM Isolation)                     │
│  • Dedicated VMs with GPU passthrough           │
│  • Full SSH access                              │
│  • Tenant controls environment                  │
│  • Strongest isolation                          │
└─────────────────────────────────────────────────┘

Security Boundaries

┌─────────────────────────────────────────────────┐
│              Tenant A                            │
│  ┌─────────────────────────────────────┐       │
│  │         Network Namespace            │       │
│  │  • Isolated network                  │       │
│  │  • Private IP range                  │       │
│  └─────────────────────────────────────┘       │
│  ┌─────────────────────────────────────┐       │
│  │         Storage Namespace            │       │
│  │  • Encrypted volumes                 │       │
│  │  • Tenant-specific keys              │       │
│  └─────────────────────────────────────┘       │
│  ┌─────────────────────────────────────┐       │
│  │         Compute Resources            │       │
│  │  • Dedicated or shared GPU           │       │
│  │  • Resource limits enforced          │       │
│  └─────────────────────────────────────┘       │
└─────────────────────────────────────────────────┘

GPU Resource Management

# Tenant Resource Configuration
tenant_config:
  name: "acme-corp"
  tier: "pro"

  gpu_quota:
    max_gpus: 8
    max_gpu_hours_per_month: 1000
    allowed_gpu_types:
      - "A100-40GB"
      - "A100-80GB"
      - "H100"

  compute_quota:
    max_vcpus: 64
    max_memory_gb: 256

  storage_quota:
    max_storage_gb: 1000
    max_snapshots: 10

  network:
    private_subnet: "10.100.0.0/24"
    egress_limit_gbps: 10

Billing and Metering

┌─────────────────────────────────────────────────┐
│           Per-Tenant Usage Tracking              │
├─────────────────────────────────────────────────┤
│  Metric              │ Unit        │ This Month │
├──────────────────────┼─────────────┼────────────┤
│  GPU Hours (A100)    │ hours       │    450     │
│  GPU Hours (H100)    │ hours       │    120     │
│  Inference Requests  │ thousands   │  15,000    │
│  Storage             │ GB-months   │    500     │
│  Network Egress      │ GB          │    250     │
│  Snapshots           │ count       │      8     │
└──────────────────────┴─────────────┴────────────┘

Best Practices for GPU Multi-Tenancy

For Platform Providers

Implement defense in depth
- Container isolation + network isolation + storage encryption
Use hardware isolation when possible
- MIG for high-security tenants
- GPU passthrough for dedicated instances
Monitor and alert on anomalies
- Unusual GPU memory access patterns
- Cross-tenant network traffic
- Resource quota violations

For Tenants

Right-size your GPU needs
- Don't request H100 for inference workloads
- Use serverless for variable loads
Implement checkpointing
- Save model state regularly
- Enable preemption tolerance
Use the right isolation level
- API access (AaaS) for standard inference
- Containers (PaaS) for custom models
- VMs (IaaS) for sensitive training

Summary

GPU multi-tenancy is more complex than traditional cloud multi-tenancy:

Aspect	Traditional Cloud	GPU Cloud
Resource Granularity	Fine-grained	Coarse-grained
Sharing Overhead	Low	Higher
Isolation Options	Mature	Evolving
Cost Efficiency	High	Lower
Workload Duration	Seconds-minutes	Hours-days

Float16's layered approach (AaaS/PaaS/IaaS) provides the right balance of isolation, performance, and cost for different use cases.

Congratulations!

You've completed the Multi-Tenancy course! You now understand:

How multi-tenancy works in cloud computing
Different isolation models and their trade-offs
Security and resource management challenges
How GPU clouds implement multi-tenancy

Ready to explore more? Check out our Serverless GPU course to learn about modern GPU deployment options.

Multi-Tenancy in GPU Cloud Platforms

GPU Multi-Tenancy Challenges

Challenge 1: GPU Scarcity

Challenge 2: GPU Memory Isolation

Challenge 3: Long-Running Workloads

Challenge 4: All-or-Nothing Execution

GPU Sharing Technologies

1. Time-Slicing (Context Switching)

2. NVIDIA MPS (Multi-Process Service)

3. NVIDIA MIG (Multi-Instance GPU)

4. vGPU (Virtual GPU)

GPU Scheduling Strategies

Job Queue Based

Preemption

Fair Share Scheduling

Float16's Multi-Tenancy Approach

Three-Layer Isolation Model

Security Boundaries

GPU Resource Management

Billing and Metering

Best Practices for GPU Multi-Tenancy

For Platform Providers

For Tenants

Summary

Congratulations!