When to Use Serverless vs Dedicated GPU
Now that you understand both approaches, let's build a decision framework to choose the right one for your use case.
The Decision Matrix
Low Utilization High Utilization
(<30%) (>50%)
┌─────────────────┬─────────────────┐
Short Requests │ SERVERLESS │ DEDICATED │
(< 1 minute) │ Best choice │ Consider both │
├─────────────────┼─────────────────┤
Long Jobs │ DEDICATED │ DEDICATED │
(> 10 minutes) │ (with scheduling) Best choice │
└─────────────────┴─────────────────┘
Cost Analysis Framework
Break-Even Calculation
When does serverless become more expensive than dedicated?
Serverless Cost = Price per request × Number of requests
Dedicated Cost = Hourly rate × Hours running
Break-even point:
Requests = (Hourly rate × Hours) / Price per request
Example Calculation
Serverless: $0.002 per request (50ms average)
Dedicated: $4.00 per hour (A100)
Monthly hours: 720
Dedicated monthly cost: $4.00 × 720 = $2,880
Break-even requests: $2,880 / $0.002 = 1,440,000
If you have:
< 1.44M requests/month → Serverless cheaper
> 1.44M requests/month → Dedicated cheaper
Cost Comparison Chart
Monthly Cost by Request Volume:
$5,000 ┤ ╱ Serverless
│ ╱
$4,000 ┤ ╱
│ ╱
$3,000 ┤─────────────────────────╳──────── Dedicated
│ ╱│
$2,000 ┤ ╱ │
│ ╱ │
$1,000 ┤ ╱ │
│ ╱ │
$0 ┼────────╱───────────────┴──────────
0 500K 1M 1.5M 2M 2.5M
Requests/Month
Break-even: ~1.4M requests
Decision Flowchart
START
│
▼
┌─────────────────────────────────┐
│ Is this a training workload? │
└─────────────────┬───────────────┘
│
┌────YES──┴──NO────┐
▼ ▼
DEDICATED ┌──────────────────────┐
│ Is latency critical? │
│ (< 500ms required) │
└──────────┬───────────┘
│
┌───YES───┴───NO────┐
▼ ▼
DEDICATED ┌───────────────────┐
(keep warm) │ Is traffic │
│ predictable? │
└────────┬──────────┘
│
┌───YES────┴───NO─────┐
▼ ▼
┌──────────────────┐ SERVERLESS
│ > 50% utilization│
│ expected? │
└────────┬─────────┘
│
┌───YES───┴───NO────┐
▼ ▼
DEDICATED SERVERLESS
Use Case Mapping
Training & Fine-tuning
| Scenario | Recommendation | Reason |
|---|---|---|
| Large model training (days) | Dedicated + Reserved | High utilization, cost savings |
| Fine-tuning (hours) | Dedicated + Spot | Fault-tolerant, cost savings |
| Experiment iteration | Serverless | Low utilization between experiments |
| Hyperparameter search | Dedicated + Spot | Parallelizable, interruptible |
Inference
| Scenario | Recommendation | Reason |
|---|---|---|
| Real-time chatbot | Dedicated | Low latency required |
| Batch processing | Serverless or Spot | Variable load, interruptible |
| API with variable traffic | Serverless | Auto-scaling, pay-per-use |
| High-volume API | Dedicated | Cost-effective at scale |
| Development/testing | Serverless | Low utilization |
By Company Stage
STARTUP (< $10K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│ Development: Serverless │
│ Testing: Serverless │
│ Production: Serverless (start) → Dedicated (scale) │
│ │
│ Priority: Minimize upfront costs, validate product │
└─────────────────────────────────────────────────────────┘
GROWTH ($10K-$100K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│ Development: Serverless │
│ Training: Dedicated + Spot │
│ Production: Dedicated + Reserved (core capacity) │
│ Serverless (burst capacity) │
│ │
│ Priority: Optimize cost while maintaining quality │
└─────────────────────────────────────────────────────────┘
ENTERPRISE (> $100K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│ Development: Dedicated (shared dev cluster) │
│ Training: Dedicated + Reserved clusters │
│ Production: Dedicated + Reserved + Multi-region │
│ │
│ Priority: Reliability, compliance, predictable costs │
└─────────────────────────────────────────────────────────┘
Hybrid Architecture
The best solution often combines both approaches:
┌─────────────────────────────────────────────────────────┐
│ Hybrid GPU Architecture │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Load Balancer │ │
│ └───────────────────┬─────────────────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ DEDICATED │ │ SERVERLESS │ │
│ │ Instances │ │ Pool │ │
│ │ │ │ │ │
│ │ Base load │ │ Burst load │ │
│ │ 0-1000 RPS │ │ 1000+ RPS │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Benefits: │
│ • Predictable cost for base load │
│ • Auto-scale for bursts │
│ • No over-provisioning │
└─────────────────────────────────────────────────────────┘
Implementation Example
def route_request(request, current_load):
# Dedicated handles base load (cheaper per request)
if current_load < DEDICATED_CAPACITY:
return dedicated_inference(request)
# Serverless handles overflow (auto-scales)
else:
return serverless_inference(request)
Key Questions to Ask
About Your Workload
-
How long does each task run?
- < 1 minute → Serverless favored
-
10 minutes → Dedicated favored
-
How predictable is your traffic?
- Highly variable → Serverless
- Steady/predictable → Dedicated
-
What's your latency requirement?
- Sub-second critical → Dedicated (warm)
- Seconds acceptable → Either
-
What's your expected utilization?
- < 30% → Serverless
-
50% → Dedicated
About Your Team
-
Do you have infrastructure expertise?
- Limited → Serverless
- Strong → Either
-
How important is full control?
- Must customize everything → Dedicated
- Standard setup fine → Serverless
-
What's your DevOps capacity?
- Limited → Serverless
- Dedicated team → Either
About Your Business
-
What's your budget model?
- Variable OK → Serverless
- Fixed budget → Dedicated + Reserved
-
What are compliance requirements?
- Strict isolation → Dedicated
- Standard → Either
-
What's your growth trajectory?
- Uncertain → Serverless (flexibility)
- Clear growth → Plan for Dedicated
Common Mistakes to Avoid
Mistake 1: Serverless for Training
❌ WRONG:
Training job on serverless → Pays cold start repeatedly
→ Timeouts on long jobs
→ State management issues
✓ RIGHT:
Training on dedicated → Consistent environment
→ No cold starts
→ Persistent checkpoints
Mistake 2: Dedicated for Sporadic Use
❌ WRONG:
Dedicated A100 for dev → $2,880/month
Actual usage → 10 hours/month
Effective cost → $288/hour (!)
✓ RIGHT:
Serverless for dev → Pay only for 10 hours
→ ~$40-100/month
Mistake 3: Ignoring Cold Starts
❌ WRONG:
Production API on serverless without warm instances
→ First users wait 15+ seconds
✓ RIGHT:
Configure minimum instances OR use dedicated
→ Consistent sub-second response
Mistake 4: Over-Provisioning
❌ WRONG:
"We might need 100 GPUs" → Rent 100 dedicated
Actual usage → 10 GPUs average
Waste → 90% of budget
✓ RIGHT:
10 dedicated (base) + serverless (bursts)
→ Pay for what you use
Summary: Quick Reference
| Factor | Serverless | Dedicated |
|---|---|---|
| Traffic Pattern | Variable | Steady |
| Job Duration | Short | Long |
| Utilization | < 30% | > 50% |
| Cold Start OK? | Yes | N/A |
| Budget Type | Variable | Fixed |
| Control Needed | Low | High |
| Team Size | Small | Any |
| Compliance | Standard | Strict |
What's Next?
In the final chapter, we'll explore Float16's specific offerings for both serverless and dedicated GPU access, and how to get started with each.