Chapter 4 of 5•6 min read

Resource Management in Multi-Tenant Systems

Understanding how to fairly allocate resources, prevent noisy neighbors, and ensure quality of service in multi-tenant environments.

Resource Management in Multi-Tenant Systems

When multiple tenants share resources, how do you ensure fairness and prevent one tenant from impacting others? This chapter covers resource management strategies.

The Noisy Neighbor Problem

┌─────────────────────────────────────────────┐
│              Shared Server                   │
│                                             │
│  Tenant A: Running normal workload (20%)    │
│  Tenant B: Running normal workload (20%)    │
│  Tenant C: HEAVY BATCH JOB (95%) ← Problem! │
│                                             │
│  Result: Tenants A and B experience         │
│  degraded performance                       │
└─────────────────────────────────────────────┘

This is the "noisy neighbor" problem - one tenant's resource consumption affects others.

Resources to Manage

Resource	Challenge	Impact
CPU	Compute-intensive tasks	Slow response times
Memory	Large data processing	Out-of-memory errors
Disk I/O	Database queries, file operations	High latency
Network	API calls, data transfer	Timeouts
GPU	AI/ML workloads	Training/inference delays
Storage	Data growth	Quota exceeded

Resource Limits and Quotas

Hard Limits

Tenant cannot exceed these limits:

Tenant Configuration:
  cpu_limit: 4 cores
  memory_limit: 16GB
  storage_limit: 100GB
  gpu_limit: 1 GPU
  api_rate_limit: 1000 req/min

When limit is reached:

Requests may be rejected (429 Too Many Requests)
Jobs may be queued
Resources may be throttled

Soft Limits (Burst Capacity)

Tenant can exceed temporarily when resources are available:

┌──────────────────────────────────────┐
│                                      │
│  ████████████████ Guaranteed (2 CPU) │
│  ░░░░░░░░░░░░░░░░ Burst cap (4 CPU)  │
│                                      │
│  Normal: Uses 2 CPU (guaranteed)     │
│  Burst:  Can use up to 4 CPU if      │
│          other tenants aren't using  │
└──────────────────────────────────────┘

CPU Management

CPU Shares (Relative Priority)

Tenant A: 1000 shares
Tenant B: 500 shares
Tenant C: 500 shares

When competing for CPU:
- Tenant A gets 50% (1000/2000)
- Tenant B gets 25% (500/2000)
- Tenant C gets 25% (500/2000)

When not competing:
- Any tenant can use available CPU

CPU Throttling

Tenant A requests 8 CPU cores
But only allocated 4 cores

Result:
┌─────────────────────────────────────┐
│  Time    │ Tenant A's Experience    │
├──────────┼──────────────────────────┤
│ 0-50ms   │ Running (uses 4 cores)   │
│ 50-100ms │ Throttled (waiting)      │
│100-150ms │ Running (uses 4 cores)   │
│150-200ms │ Throttled (waiting)      │
└──────────┴──────────────────────────┘

Appears as 50% slower execution

Memory Management

Memory Limits

# Kubernetes Pod example
resources:
  requests:
    memory: "4Gi"    # Guaranteed minimum
  limits:
    memory: "8Gi"    # Maximum allowed

Out of Memory Handling

When memory limit exceeded:

Option 1: OOM Kill
- Process killed immediately
- Pod restarted

Option 2: Swap
- Data moved to disk
- Performance degrades significantly

Option 3: Reject
- New memory allocations fail
- Application handles gracefully

Rate Limiting

Preventing API abuse:

Token Bucket Algorithm

┌──────────────────────────────────────┐
│          Token Bucket                 │
│                                      │
│  Capacity: 100 tokens                │
│  Refill Rate: 10 tokens/second       │
│                                      │
│  Each request consumes 1 token       │
│  If bucket empty → 429 Rate Limited  │
└──────────────────────────────────────┘

Tiered Rate Limits

┌─────────────────────────────────────────────┐
│  Plan      │ Requests/min │ Requests/day   │
├────────────┼──────────────┼────────────────┤
│  Free      │     60       │    1,000       │
│  Pro       │    600       │   100,000      │
│  Enterprise│   6,000      │ Unlimited      │
└─────────────────────────────────────────────┘

Rate Limit Response

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699999999

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Please try again in 30 seconds."
}

Fair Scheduling

Priority Queues

┌─────────────────────────────────────────────┐
│              Job Scheduler                   │
│                                             │
│  High Priority Queue:   [J1] [J2]           │
│  Medium Priority Queue: [J3] [J4] [J5]      │
│  Low Priority Queue:    [J6] [J7] [J8] [J9] │
│                                             │
│  Execution order: J1, J2, J3, J4, J5, ...   │
└─────────────────────────────────────────────┘

Weighted Fair Queuing

Enterprise Tenant: Weight 4
Pro Tenant: Weight 2
Free Tenant: Weight 1

For every 7 time slots:
- Enterprise gets 4
- Pro gets 2
- Free gets 1

Quality of Service (QoS)

SLA Tiers

┌─────────────────────────────────────────────┐
│              SLA Definitions                 │
├─────────────────────────────────────────────┤
│  Tier      │ Availability │ Latency P99     │
├────────────┼──────────────┼─────────────────┤
│  Premium   │   99.99%     │    50ms         │
│  Standard  │   99.9%      │   200ms         │
│  Basic     │   99.5%      │   500ms         │
└─────────────────────────────────────────────┘

Implementing QoS

# Route premium tenants to better resources
def route_request(tenant):
    if tenant.tier == 'premium':
        return premium_server_pool
    elif tenant.tier == 'standard':
        return standard_server_pool
    else:
        return shared_server_pool

Monitoring and Alerting

Per-Tenant Metrics

Track for each tenant:

┌─────────────────────────────────────────────┐
│  Metric                │ Value    │ Limit   │
├────────────────────────┼──────────┼─────────┤
│  CPU Usage             │ 45%      │ 100%    │
│  Memory Usage          │ 12GB     │ 16GB    │
│  Storage Used          │ 78GB     │ 100GB   │
│  API Requests/min      │ 450      │ 1000    │
│  Bandwidth Used        │ 2.3TB    │ 5TB     │
└─────────────────────────────────────────────┘

Alert Thresholds

alerts:
  - name: tenant_cpu_high
    condition: cpu_usage > 80%
    duration: 5m
    action: notify_tenant

  - name: tenant_storage_critical
    condition: storage_usage > 90%
    action: notify_tenant_and_ops

  - name: tenant_rate_limited
    condition: rate_limit_hit > 100/hour
    action: suggest_upgrade

Cost Allocation

Usage-Based Billing

Tenant A Monthly Bill:
┌─────────────────────────────────────────────┐
│  Resource        │ Usage    │ Cost          │
├──────────────────┼──────────┼───────────────┤
│  Compute Hours   │ 720h     │ $144.00       │
│  Storage         │ 50GB     │ $5.00         │
│  Network Egress  │ 100GB    │ $8.00         │
│  API Calls       │ 1M       │ $1.00         │
├──────────────────┼──────────┼───────────────┤
│  Total           │          │ $158.00       │
└─────────────────────────────────────────────┘

Reserved vs On-Demand

Reserved Pricing:
- Pay upfront for guaranteed resources
- Lower per-unit cost
- Use it or lose it

On-Demand Pricing:
- Pay for what you use
- Higher per-unit cost
- Flexible scaling

Resource Isolation Techniques

Linux cgroups

Kernel-level resource isolation:

# Create cgroup for tenant
cgcreate -g cpu,memory:tenant_123

# Set CPU limit (50%)
cgset -r cpu.cfs_quota_us=50000 tenant_123

# Set memory limit (4GB)
cgset -r memory.limit_in_bytes=4G tenant_123

# Run process in cgroup
cgexec -g cpu,memory:tenant_123 /app/process

Network Bandwidth Limits

# Using tc (traffic control)
tc qdisc add dev eth0 root handle 1: htb
tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit
tc filter add dev eth0 parent 1: protocol ip prio 1 \
   u32 match ip src 10.0.1.0/24 flowid 1:1

What's Next?

In the final chapter, we'll explore how Float16 implements multi-tenancy for GPU workloads - the unique challenges and solutions for AI/ML infrastructure.