Resource Management in Multi-Tenant Systems
When multiple tenants share resources, how do you ensure fairness and prevent one tenant from impacting others? This chapter covers resource management strategies.
The Noisy Neighbor Problem
┌─────────────────────────────────────────────┐
│ Shared Server │
│ │
│ Tenant A: Running normal workload (20%) │
│ Tenant B: Running normal workload (20%) │
│ Tenant C: HEAVY BATCH JOB (95%) ← Problem! │
│ │
│ Result: Tenants A and B experience │
│ degraded performance │
└─────────────────────────────────────────────┘
This is the "noisy neighbor" problem - one tenant's resource consumption affects others.
Resources to Manage
| Resource | Challenge | Impact |
|---|---|---|
| CPU | Compute-intensive tasks | Slow response times |
| Memory | Large data processing | Out-of-memory errors |
| Disk I/O | Database queries, file operations | High latency |
| Network | API calls, data transfer | Timeouts |
| GPU | AI/ML workloads | Training/inference delays |
| Storage | Data growth | Quota exceeded |
Resource Limits and Quotas
Hard Limits
Tenant cannot exceed these limits:
Tenant Configuration:
cpu_limit: 4 cores
memory_limit: 16GB
storage_limit: 100GB
gpu_limit: 1 GPU
api_rate_limit: 1000 req/min
When limit is reached:
- Requests may be rejected (429 Too Many Requests)
- Jobs may be queued
- Resources may be throttled
Soft Limits (Burst Capacity)
Tenant can exceed temporarily when resources are available:
┌──────────────────────────────────────┐
│ │
│ ████████████████ Guaranteed (2 CPU) │
│ ░░░░░░░░░░░░░░░░ Burst cap (4 CPU) │
│ │
│ Normal: Uses 2 CPU (guaranteed) │
│ Burst: Can use up to 4 CPU if │
│ other tenants aren't using │
└──────────────────────────────────────┘
CPU Management
CPU Shares (Relative Priority)
Tenant A: 1000 shares
Tenant B: 500 shares
Tenant C: 500 shares
When competing for CPU:
- Tenant A gets 50% (1000/2000)
- Tenant B gets 25% (500/2000)
- Tenant C gets 25% (500/2000)
When not competing:
- Any tenant can use available CPU
CPU Throttling
Tenant A requests 8 CPU cores
But only allocated 4 cores
Result:
┌─────────────────────────────────────┐
│ Time │ Tenant A's Experience │
├──────────┼──────────────────────────┤
│ 0-50ms │ Running (uses 4 cores) │
│ 50-100ms │ Throttled (waiting) │
│100-150ms │ Running (uses 4 cores) │
│150-200ms │ Throttled (waiting) │
└──────────┴──────────────────────────┘
Appears as 50% slower execution
Memory Management
Memory Limits
# Kubernetes Pod example
resources:
requests:
memory: "4Gi" # Guaranteed minimum
limits:
memory: "8Gi" # Maximum allowed
Out of Memory Handling
When memory limit exceeded:
Option 1: OOM Kill
- Process killed immediately
- Pod restarted
Option 2: Swap
- Data moved to disk
- Performance degrades significantly
Option 3: Reject
- New memory allocations fail
- Application handles gracefully
Rate Limiting
Preventing API abuse:
Token Bucket Algorithm
┌──────────────────────────────────────┐
│ Token Bucket │
│ │
│ Capacity: 100 tokens │
│ Refill Rate: 10 tokens/second │
│ │
│ Each request consumes 1 token │
│ If bucket empty → 429 Rate Limited │
└──────────────────────────────────────┘
Tiered Rate Limits
┌─────────────────────────────────────────────┐
│ Plan │ Requests/min │ Requests/day │
├────────────┼──────────────┼────────────────┤
│ Free │ 60 │ 1,000 │
│ Pro │ 600 │ 100,000 │
│ Enterprise│ 6,000 │ Unlimited │
└─────────────────────────────────────────────┘
Rate Limit Response
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699999999
{
"error": "rate_limit_exceeded",
"message": "Too many requests. Please try again in 30 seconds."
}
Fair Scheduling
Priority Queues
┌─────────────────────────────────────────────┐
│ Job Scheduler │
│ │
│ High Priority Queue: [J1] [J2] │
│ Medium Priority Queue: [J3] [J4] [J5] │
│ Low Priority Queue: [J6] [J7] [J8] [J9] │
│ │
│ Execution order: J1, J2, J3, J4, J5, ... │
└─────────────────────────────────────────────┘
Weighted Fair Queuing
Enterprise Tenant: Weight 4
Pro Tenant: Weight 2
Free Tenant: Weight 1
For every 7 time slots:
- Enterprise gets 4
- Pro gets 2
- Free gets 1
Quality of Service (QoS)
SLA Tiers
┌─────────────────────────────────────────────┐
│ SLA Definitions │
├─────────────────────────────────────────────┤
│ Tier │ Availability │ Latency P99 │
├────────────┼──────────────┼─────────────────┤
│ Premium │ 99.99% │ 50ms │
│ Standard │ 99.9% │ 200ms │
│ Basic │ 99.5% │ 500ms │
└─────────────────────────────────────────────┘
Implementing QoS
# Route premium tenants to better resources
def route_request(tenant):
if tenant.tier == 'premium':
return premium_server_pool
elif tenant.tier == 'standard':
return standard_server_pool
else:
return shared_server_pool
Monitoring and Alerting
Per-Tenant Metrics
Track for each tenant:
┌─────────────────────────────────────────────┐
│ Metric │ Value │ Limit │
├────────────────────────┼──────────┼─────────┤
│ CPU Usage │ 45% │ 100% │
│ Memory Usage │ 12GB │ 16GB │
│ Storage Used │ 78GB │ 100GB │
│ API Requests/min │ 450 │ 1000 │
│ Bandwidth Used │ 2.3TB │ 5TB │
└─────────────────────────────────────────────┘
Alert Thresholds
alerts:
- name: tenant_cpu_high
condition: cpu_usage > 80%
duration: 5m
action: notify_tenant
- name: tenant_storage_critical
condition: storage_usage > 90%
action: notify_tenant_and_ops
- name: tenant_rate_limited
condition: rate_limit_hit > 100/hour
action: suggest_upgrade
Cost Allocation
Usage-Based Billing
Tenant A Monthly Bill:
┌─────────────────────────────────────────────┐
│ Resource │ Usage │ Cost │
├──────────────────┼──────────┼───────────────┤
│ Compute Hours │ 720h │ $144.00 │
│ Storage │ 50GB │ $5.00 │
│ Network Egress │ 100GB │ $8.00 │
│ API Calls │ 1M │ $1.00 │
├──────────────────┼──────────┼───────────────┤
│ Total │ │ $158.00 │
└─────────────────────────────────────────────┘
Reserved vs On-Demand
Reserved Pricing:
- Pay upfront for guaranteed resources
- Lower per-unit cost
- Use it or lose it
On-Demand Pricing:
- Pay for what you use
- Higher per-unit cost
- Flexible scaling
Resource Isolation Techniques
Linux cgroups
Kernel-level resource isolation:
# Create cgroup for tenant
cgcreate -g cpu,memory:tenant_123
# Set CPU limit (50%)
cgset -r cpu.cfs_quota_us=50000 tenant_123
# Set memory limit (4GB)
cgset -r memory.limit_in_bytes=4G tenant_123
# Run process in cgroup
cgexec -g cpu,memory:tenant_123 /app/process
Network Bandwidth Limits
# Using tc (traffic control)
tc qdisc add dev eth0 root handle 1: htb
tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit
tc filter add dev eth0 parent 1: protocol ip prio 1 \
u32 match ip src 10.0.1.0/24 flowid 1:1
What's Next?
In the final chapter, we'll explore how Float16 implements multi-tenancy for GPU workloads - the unique challenges and solutions for AI/ML infrastructure.