Chapter 4 of 5•6 min read

When to Use Serverless vs Dedicated GPU

A comprehensive decision framework for choosing between serverless and dedicated GPU - cost analysis, use case mapping, and practical guidelines.

When to Use Serverless vs Dedicated GPU

Now that you understand both approaches, let's build a decision framework to choose the right one for your use case.

The Decision Matrix

                        Low Utilization    High Utilization
                        (<30%)             (>50%)
                    ┌─────────────────┬─────────────────┐
  Short Requests    │   SERVERLESS    │   DEDICATED     │
  (< 1 minute)      │   Best choice   │   Consider both │
                    ├─────────────────┼─────────────────┤
  Long Jobs         │   DEDICATED     │   DEDICATED     │
  (> 10 minutes)    │   (with scheduling) Best choice  │
                    └─────────────────┴─────────────────┘

Cost Analysis Framework

Break-Even Calculation

When does serverless become more expensive than dedicated?

Serverless Cost = Price per request × Number of requests
Dedicated Cost  = Hourly rate × Hours running

Break-even point:
Requests = (Hourly rate × Hours) / Price per request

Example Calculation

Serverless: $0.002 per request (50ms average)
Dedicated: $4.00 per hour (A100)

Monthly hours: 720
Dedicated monthly cost: $4.00 × 720 = $2,880

Break-even requests: $2,880 / $0.002 = 1,440,000

If you have:
< 1.44M requests/month → Serverless cheaper
> 1.44M requests/month → Dedicated cheaper

Cost Comparison Chart

Monthly Cost by Request Volume:

$5,000 ┤                                    ╱ Serverless
       │                                  ╱
$4,000 ┤                               ╱
       │                            ╱
$3,000 ┤─────────────────────────╳──────── Dedicated
       │                       ╱│
$2,000 ┤                    ╱   │
       │                 ╱      │
$1,000 ┤              ╱         │
       │           ╱            │
$0     ┼────────╱───────────────┴──────────
       0    500K    1M    1.5M    2M    2.5M
                   Requests/Month

       Break-even: ~1.4M requests

Decision Flowchart

START
  │
  ▼
┌─────────────────────────────────┐
│ Is this a training workload?    │
└─────────────────┬───────────────┘
                  │
        ┌────YES──┴──NO────┐
        ▼                   ▼
   DEDICATED        ┌──────────────────────┐
                    │ Is latency critical? │
                    │ (< 500ms required)   │
                    └──────────┬───────────┘
                               │
                     ┌───YES───┴───NO────┐
                     ▼                    ▼
              DEDICATED          ┌───────────────────┐
              (keep warm)        │ Is traffic        │
                                │ predictable?       │
                                └────────┬──────────┘
                                         │
                              ┌───YES────┴───NO─────┐
                              ▼                      ▼
                    ┌──────────────────┐      SERVERLESS
                    │ > 50% utilization│
                    │ expected?        │
                    └────────┬─────────┘
                             │
                   ┌───YES───┴───NO────┐
                   ▼                    ▼
              DEDICATED            SERVERLESS

Use Case Mapping

Training & Fine-tuning

Scenario	Recommendation	Reason
Large model training (days)	Dedicated + Reserved	High utilization, cost savings
Fine-tuning (hours)	Dedicated + Spot	Fault-tolerant, cost savings
Experiment iteration	Serverless	Low utilization between experiments
Hyperparameter search	Dedicated + Spot	Parallelizable, interruptible

Inference

Scenario	Recommendation	Reason
Real-time chatbot	Dedicated	Low latency required
Batch processing	Serverless or Spot	Variable load, interruptible
API with variable traffic	Serverless	Auto-scaling, pay-per-use
High-volume API	Dedicated	Cost-effective at scale
Development/testing	Serverless	Low utilization

By Company Stage

STARTUP (< $10K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│  Development:  Serverless                               │
│  Testing:      Serverless                               │
│  Production:   Serverless (start) → Dedicated (scale)  │
│                                                         │
│  Priority: Minimize upfront costs, validate product     │
└─────────────────────────────────────────────────────────┘

GROWTH ($10K-$100K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│  Development:  Serverless                               │
│  Training:     Dedicated + Spot                         │
│  Production:   Dedicated + Reserved (core capacity)     │
│                Serverless (burst capacity)              │
│                                                         │
│  Priority: Optimize cost while maintaining quality      │
└─────────────────────────────────────────────────────────┘

ENTERPRISE (> $100K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│  Development:  Dedicated (shared dev cluster)           │
│  Training:     Dedicated + Reserved clusters            │
│  Production:   Dedicated + Reserved + Multi-region      │
│                                                         │
│  Priority: Reliability, compliance, predictable costs   │
└─────────────────────────────────────────────────────────┘

Hybrid Architecture

The best solution often combines both approaches:

┌─────────────────────────────────────────────────────────┐
│              Hybrid GPU Architecture                     │
│                                                         │
│  ┌─────────────────────────────────────────┐           │
│  │         Load Balancer                    │           │
│  └───────────────────┬─────────────────────┘           │
│                      │                                  │
│         ┌────────────┴────────────┐                    │
│         │                         │                    │
│  ┌──────▼──────┐          ┌──────▼──────┐             │
│  │  DEDICATED  │          │ SERVERLESS  │             │
│  │  Instances  │          │   Pool      │             │
│  │             │          │             │             │
│  │ Base load   │          │ Burst load  │             │
│  │ 0-1000 RPS  │          │ 1000+ RPS   │             │
│  └─────────────┘          └─────────────┘             │
│                                                         │
│  Benefits:                                              │
│  • Predictable cost for base load                      │
│  • Auto-scale for bursts                               │
│  • No over-provisioning                                │
└─────────────────────────────────────────────────────────┘

Implementation Example

def route_request(request, current_load):
    # Dedicated handles base load (cheaper per request)
    if current_load < DEDICATED_CAPACITY:
        return dedicated_inference(request)

    # Serverless handles overflow (auto-scales)
    else:
        return serverless_inference(request)

Key Questions to Ask

About Your Workload

How long does each task run?
- < 1 minute → Serverless favored
- 10 minutes → Dedicated favored
How predictable is your traffic?
- Highly variable → Serverless
- Steady/predictable → Dedicated
What's your latency requirement?
- Sub-second critical → Dedicated (warm)
- Seconds acceptable → Either
What's your expected utilization?
- < 30% → Serverless
- 50% → Dedicated

About Your Team

Do you have infrastructure expertise?
- Limited → Serverless
- Strong → Either
How important is full control?
- Must customize everything → Dedicated
- Standard setup fine → Serverless
What's your DevOps capacity?
- Limited → Serverless
- Dedicated team → Either

About Your Business

What's your budget model?
- Variable OK → Serverless
- Fixed budget → Dedicated + Reserved
What are compliance requirements?
- Strict isolation → Dedicated
- Standard → Either
What's your growth trajectory?
- Uncertain → Serverless (flexibility)
- Clear growth → Plan for Dedicated

Common Mistakes to Avoid

Mistake 1: Serverless for Training

❌ WRONG:
Training job on serverless → Pays cold start repeatedly
                          → Timeouts on long jobs
                          → State management issues

✓ RIGHT:
Training on dedicated → Consistent environment
                     → No cold starts
                     → Persistent checkpoints

Mistake 2: Dedicated for Sporadic Use

❌ WRONG:
Dedicated A100 for dev → $2,880/month
Actual usage → 10 hours/month
Effective cost → $288/hour (!)

✓ RIGHT:
Serverless for dev → Pay only for 10 hours
                  → ~$40-100/month

Mistake 3: Ignoring Cold Starts

❌ WRONG:
Production API on serverless without warm instances
→ First users wait 15+ seconds

✓ RIGHT:
Configure minimum instances OR use dedicated
→ Consistent sub-second response

Mistake 4: Over-Provisioning

❌ WRONG:
"We might need 100 GPUs" → Rent 100 dedicated
Actual usage → 10 GPUs average
Waste → 90% of budget

✓ RIGHT:
10 dedicated (base) + serverless (bursts)
→ Pay for what you use

Summary: Quick Reference

Factor	Serverless	Dedicated
Traffic Pattern	Variable	Steady
Job Duration	Short	Long
Utilization	< 30%	> 50%
Cold Start OK?	Yes	N/A
Budget Type	Variable	Fixed
Control Needed	Low	High
Team Size	Small	Any
Compliance	Standard	Strict

What's Next?

In the final chapter, we'll explore Float16's specific offerings for both serverless and dedicated GPU access, and how to get started with each.