Chapter 4: Decision Framework

When to Use Serverless vs Dedicated GPU

A comprehensive decision framework for choosing between serverless and dedicated GPU - cost analysis, use case mapping, and practical guidelines.

When to Use Serverless vs Dedicated GPU

Now that you understand both approaches, let's build a decision framework to choose the right one for your use case.

The Decision Matrix

                        Low Utilization    High Utilization
                        (<30%)             (>50%)
                    ┌─────────────────┬─────────────────┐
  Short Requests    │   SERVERLESS    │   DEDICATED     │
  (< 1 minute)      │   Best choice   │   Consider both │
                    ├─────────────────┼─────────────────┤
  Long Jobs         │   DEDICATED     │   DEDICATED     │
  (> 10 minutes)    │   (with scheduling) Best choice  │
                    └─────────────────┴─────────────────┘

Cost Analysis Framework

Break-Even Calculation

When does serverless become more expensive than dedicated?

Serverless Cost = Price per request × Number of requests
Dedicated Cost  = Hourly rate × Hours running

Break-even point:
Requests = (Hourly rate × Hours) / Price per request

Example Calculation

Serverless: $0.002 per request (50ms average)
Dedicated: $4.00 per hour (A100)

Monthly hours: 720
Dedicated monthly cost: $4.00 × 720 = $2,880

Break-even requests: $2,880 / $0.002 = 1,440,000

If you have:
< 1.44M requests/month → Serverless cheaper
> 1.44M requests/month → Dedicated cheaper

Cost Comparison Chart

Monthly Cost by Request Volume:

$5,000 ┤                                    ╱ Serverless
       │                                  ╱
$4,000 ┤                               ╱
       │                            ╱
$3,000 ┤─────────────────────────╳──────── Dedicated
       │                       ╱│
$2,000 ┤                    ╱   │
       │                 ╱      │
$1,000 ┤              ╱         │
       │           ╱            │
$0     ┼────────╱───────────────┴──────────
       0    500K    1M    1.5M    2M    2.5M
                   Requests/Month

       Break-even: ~1.4M requests

Decision Flowchart

START
  │
  ▼
┌─────────────────────────────────┐
│ Is this a training workload?    │
└─────────────────┬───────────────┘
                  │
        ┌────YES──┴──NO────┐
        ▼                   ▼
   DEDICATED        ┌──────────────────────┐
                    │ Is latency critical? │
                    │ (< 500ms required)   │
                    └──────────┬───────────┘
                               │
                     ┌───YES───┴───NO────┐
                     ▼                    ▼
              DEDICATED          ┌───────────────────┐
              (keep warm)        │ Is traffic        │
                                │ predictable?       │
                                └────────┬──────────┘
                                         │
                              ┌───YES────┴───NO─────┐
                              ▼                      ▼
                    ┌──────────────────┐      SERVERLESS
                    │ > 50% utilization│
                    │ expected?        │
                    └────────┬─────────┘
                             │
                   ┌───YES───┴───NO────┐
                   ▼                    ▼
              DEDICATED            SERVERLESS

Use Case Mapping

Training & Fine-tuning

Scenario Recommendation Reason
Large model training (days) Dedicated + Reserved High utilization, cost savings
Fine-tuning (hours) Dedicated + Spot Fault-tolerant, cost savings
Experiment iteration Serverless Low utilization between experiments
Hyperparameter search Dedicated + Spot Parallelizable, interruptible

Inference

Scenario Recommendation Reason
Real-time chatbot Dedicated Low latency required
Batch processing Serverless or Spot Variable load, interruptible
API with variable traffic Serverless Auto-scaling, pay-per-use
High-volume API Dedicated Cost-effective at scale
Development/testing Serverless Low utilization

By Company Stage

STARTUP (< $10K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│  Development:  Serverless                               │
│  Testing:      Serverless                               │
│  Production:   Serverless (start) → Dedicated (scale)  │
│                                                         │
│  Priority: Minimize upfront costs, validate product     │
└─────────────────────────────────────────────────────────┘

GROWTH ($10K-$100K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│  Development:  Serverless                               │
│  Training:     Dedicated + Spot                         │
│  Production:   Dedicated + Reserved (core capacity)     │
│                Serverless (burst capacity)              │
│                                                         │
│  Priority: Optimize cost while maintaining quality      │
└─────────────────────────────────────────────────────────┘

ENTERPRISE (> $100K/month GPU budget):
┌─────────────────────────────────────────────────────────┐
│  Development:  Dedicated (shared dev cluster)           │
│  Training:     Dedicated + Reserved clusters            │
│  Production:   Dedicated + Reserved + Multi-region      │
│                                                         │
│  Priority: Reliability, compliance, predictable costs   │
└─────────────────────────────────────────────────────────┘

Hybrid Architecture

The best solution often combines both approaches:

┌─────────────────────────────────────────────────────────┐
│              Hybrid GPU Architecture                     │
│                                                         │
│  ┌─────────────────────────────────────────┐           │
│  │         Load Balancer                    │           │
│  └───────────────────┬─────────────────────┘           │
│                      │                                  │
│         ┌────────────┴────────────┐                    │
│         │                         │                    │
│  ┌──────▼──────┐          ┌──────▼──────┐             │
│  │  DEDICATED  │          │ SERVERLESS  │             │
│  │  Instances  │          │   Pool      │             │
│  │             │          │             │             │
│  │ Base load   │          │ Burst load  │             │
│  │ 0-1000 RPS  │          │ 1000+ RPS   │             │
│  └─────────────┘          └─────────────┘             │
│                                                         │
│  Benefits:                                              │
│  • Predictable cost for base load                      │
│  • Auto-scale for bursts                               │
│  • No over-provisioning                                │
└─────────────────────────────────────────────────────────┘

Implementation Example

def route_request(request, current_load):
    # Dedicated handles base load (cheaper per request)
    if current_load < DEDICATED_CAPACITY:
        return dedicated_inference(request)

    # Serverless handles overflow (auto-scales)
    else:
        return serverless_inference(request)

Key Questions to Ask

About Your Workload

  1. How long does each task run?

    • < 1 minute → Serverless favored
    • 10 minutes → Dedicated favored

  2. How predictable is your traffic?

    • Highly variable → Serverless
    • Steady/predictable → Dedicated
  3. What's your latency requirement?

    • Sub-second critical → Dedicated (warm)
    • Seconds acceptable → Either
  4. What's your expected utilization?

    • < 30% → Serverless
    • 50% → Dedicated

About Your Team

  1. Do you have infrastructure expertise?

    • Limited → Serverless
    • Strong → Either
  2. How important is full control?

    • Must customize everything → Dedicated
    • Standard setup fine → Serverless
  3. What's your DevOps capacity?

    • Limited → Serverless
    • Dedicated team → Either

About Your Business

  1. What's your budget model?

    • Variable OK → Serverless
    • Fixed budget → Dedicated + Reserved
  2. What are compliance requirements?

    • Strict isolation → Dedicated
    • Standard → Either
  3. What's your growth trajectory?

    • Uncertain → Serverless (flexibility)
    • Clear growth → Plan for Dedicated

Common Mistakes to Avoid

Mistake 1: Serverless for Training

❌ WRONG:
Training job on serverless → Pays cold start repeatedly
                          → Timeouts on long jobs
                          → State management issues

✓ RIGHT:
Training on dedicated → Consistent environment
                     → No cold starts
                     → Persistent checkpoints

Mistake 2: Dedicated for Sporadic Use

❌ WRONG:
Dedicated A100 for dev → $2,880/month
Actual usage → 10 hours/month
Effective cost → $288/hour (!)

✓ RIGHT:
Serverless for dev → Pay only for 10 hours
                  → ~$40-100/month

Mistake 3: Ignoring Cold Starts

❌ WRONG:
Production API on serverless without warm instances
→ First users wait 15+ seconds

✓ RIGHT:
Configure minimum instances OR use dedicated
→ Consistent sub-second response

Mistake 4: Over-Provisioning

❌ WRONG:
"We might need 100 GPUs" → Rent 100 dedicated
Actual usage → 10 GPUs average
Waste → 90% of budget

✓ RIGHT:
10 dedicated (base) + serverless (bursts)
→ Pay for what you use

Summary: Quick Reference

Factor Serverless Dedicated
Traffic Pattern Variable Steady
Job Duration Short Long
Utilization < 30% > 50%
Cold Start OK? Yes N/A
Budget Type Variable Fixed
Control Needed Low High
Team Size Small Any
Compliance Standard Strict

What's Next?

In the final chapter, we'll explore Float16's specific offerings for both serverless and dedicated GPU access, and how to get started with each.