Chapter 5: Float16 Options

Float16 GPU Options

Explore Float16's serverless and dedicated GPU offerings - from AI-as-a-Service to full infrastructure control, find the right option for your workloads.

Float16 GPU Options

Float16 provides a full spectrum of GPU access options, from completely serverless to fully dedicated infrastructure. Let's explore each layer.

Float16's Three-Layer Approach

┌─────────────────────────────────────────────────────────┐
│                    Float16 Platform                      │
│                                                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │                    AaaS                          │   │
│  │            AI-as-a-Service                       │   │
│  │     (Serverless - API Only)                     │   │
│  └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│  ┌─────────────────────────────────────────────────┐   │
│  │                    PaaS                          │   │
│  │          Platform-as-a-Service                   │   │
│  │     (Managed Containers - Deploy Code)          │   │
│  └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│  ┌─────────────────────────────────────────────────┐   │
│  │                    IaaS                          │   │
│  │       Infrastructure-as-a-Service               │   │
│  │     (Dedicated VMs - Full Control)              │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Layer 1: AaaS (AI-as-a-Service)

The most serverless option - pure API access to AI models.

What You Get

┌─────────────────────────────────────────────────────────┐
│                    AaaS Features                         │
│                                                         │
│  ✓ Pre-deployed popular models                         │
│  ✓ REST API access                                     │
│  ✓ Pay-per-request pricing                             │
│  ✓ Auto-scaling (unlimited)                            │
│  ✓ Zero infrastructure management                      │
│  ✓ Web dashboard                                       │
│                                                         │
│  Available Models:                                      │
│  • LLMs (Llama, Qwen, Typhoon, etc.)                   │
│  • Image Generation (Stable Diffusion, FLUX)           │
│  • Speech (Whisper, TTS)                               │
│  • Embeddings                                          │
└─────────────────────────────────────────────────────────┘

Usage Example

import requests

# Simple API call - no GPU management
response = requests.post(
    "https://api.float16.cloud/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "llama-3-70b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ]
    }
)

print(response.json()["choices"][0]["message"]["content"])

AaaS Pricing

┌─────────────────────────────────────────────────────────┐
│              AaaS Pricing Examples                       │
│                                                         │
│  LLM Inference:                                         │
│  • Input:  $0.50 per 1M tokens                         │
│  • Output: $1.00 per 1M tokens                         │
│                                                         │
│  Image Generation:                                      │
│  • $0.02 per image (standard)                          │
│  • $0.05 per image (high-res)                          │
│                                                         │
│  Speech-to-Text:                                        │
│  • $0.006 per minute of audio                          │
└─────────────────────────────────────────────────────────┘

Best For

  • Quick prototypes
  • Variable/unpredictable traffic
  • Teams without ML infrastructure expertise
  • Applications using standard models
  • Cost-effective low-volume usage

Layer 2: PaaS (Platform-as-a-Service)

Deploy your own code on managed GPU infrastructure.

What You Get

┌─────────────────────────────────────────────────────────┐
│                    PaaS Features                         │
│                                                         │
│  ✓ Deploy custom models                                │
│  ✓ Upload your code                                    │
│  ✓ Managed containers                                  │
│  ✓ Auto-scaling                                        │
│  ✓ GPU resource allocation                             │
│  ✓ Jupyter notebooks                                   │
│  ✓ Model versioning                                    │
│                                                         │
│  You Provide:                                          │
│  • Your model files                                    │
│  • Inference code                                      │
│  • Requirements                                        │
│                                                         │
│  We Handle:                                            │
│  • Container orchestration                             │
│  • GPU allocation                                      │
│  • Scaling                                             │
│  • Health checks                                       │
└─────────────────────────────────────────────────────────┘

Deployment Example

# float16.yaml - Deployment configuration
name: my-custom-model
runtime: python3.11
gpu: A10

requirements:
  - torch==2.1.0
  - transformers==4.35.0

handler: inference.predict

scaling:
  min_instances: 1
  max_instances: 10
  target_gpu_utilization: 70%
# inference.py - Your custom inference code
from transformers import AutoModel

model = None

def load_model():
    global model
    model = AutoModel.from_pretrained("./my-model")

def predict(request):
    input_text = request["text"]
    result = model.generate(input_text)
    return {"output": result}

PaaS Pricing

┌─────────────────────────────────────────────────────────┐
│              PaaS Pricing Examples                       │
│                                                         │
│  Serverless GPU (pay per second):                       │
│  • T4:  $0.0002/second ($0.72/hour)                    │
│  • A10: $0.0005/second ($1.80/hour)                    │
│  • A100: $0.0012/second ($4.32/hour)                   │
│                                                         │
│  Min instances (always-on):                             │
│  • Charged at hourly rate                              │
│  • Reduced cold starts                                 │
│                                                         │
│  Storage:                                              │
│  • $0.10/GB/month                                      │
└─────────────────────────────────────────────────────────┘

Best For

  • Custom models
  • Fine-tuned models
  • Specific inference requirements
  • Development teams with ML expertise
  • Balance of control and convenience

Layer 3: IaaS (Infrastructure-as-a-Service)

Full control over dedicated GPU instances.

What You Get

┌─────────────────────────────────────────────────────────┐
│                    IaaS Features                         │
│                                                         │
│  ✓ Dedicated GPU instances                             │
│  ✓ Full SSH access                                     │
│  ✓ Root privileges                                     │
│  ✓ Install any software                                │
│  ✓ Persistent storage                                  │
│  ✓ Private networking                                  │
│  ✓ Snapshot/backup                                     │
│                                                         │
│  Instance Types:                                        │
│  • Single GPU (T4, A10, A100, H100)                   │
│  • Multi-GPU (up to 8x per instance)                  │
│  • CPU + GPU combinations                              │
│                                                         │
│  You Control:                                          │
│  • Operating system                                    │
│  • CUDA version                                        │
│  • All software                                        │
│  • Network configuration                               │
└─────────────────────────────────────────────────────────┘

Usage Example

# Create instance via CLI
float16 instance create \
  --name my-training-server \
  --gpu-type A100-80GB \
  --gpu-count 4 \
  --cpu 64 \
  --memory 512GB \
  --storage 2TB

# SSH access
float16 ssh my-training-server

# Inside the instance - full control
nvidia-smi
pip install torch
python train.py

IaaS Pricing

┌─────────────────────────────────────────────────────────┐
│              IaaS Pricing Examples                       │
│                                                         │
│  On-Demand (hourly):                                    │
│  • T4:      $0.50/hour                                 │
│  • A10:     $1.20/hour                                 │
│  • A100-40GB: $2.50/hour                               │
│  • A100-80GB: $3.50/hour                               │
│  • H100:    $5.00/hour                                 │
│                                                         │
│  Reserved (monthly commitment):                         │
│  • 20-40% discount                                     │
│                                                         │
│  Spot (interruptible):                                 │
│  • 50-70% discount                                     │
└─────────────────────────────────────────────────────────┘

Best For

  • Training workloads
  • Multi-GPU requirements
  • Custom environments
  • Compliance requirements
  • High utilization (>50%)
  • Full control needed

Comparison Table

Feature AaaS PaaS IaaS
Control None Medium Full
Setup Time Minutes Hours Hours-Days
Scaling Automatic Automatic Manual
Custom Models No Yes Yes
Training No Limited Yes
Cold Starts Possible Configurable None
Min Cost Pay-per-use Pay-per-use Hourly
Best For API users Developers ML Engineers

Migration Path

Start simple, scale up as needed:

┌─────────────────────────────────────────────────────────┐
│              Typical Growth Journey                      │
│                                                         │
│  Stage 1: Prototype                                     │
│  └── AaaS: Test idea with API calls                    │
│                                                         │
│  Stage 2: Custom Model                                  │
│  └── PaaS: Deploy fine-tuned model                     │
│                                                         │
│  Stage 3: Scale                                         │
│  └── PaaS + Reserved: Predictable high volume          │
│                                                         │
│  Stage 4: Advanced                                      │
│  └── IaaS: Training, multi-GPU, custom infra           │
│                                                         │
│  Stage 5: Enterprise                                    │
│  └── IaaS + PaaS: Hybrid for different workloads      │
└─────────────────────────────────────────────────────────┘

Getting Started

Quick Start: AaaS

# 1. Sign up at float16.cloud
# 2. Get API key from dashboard
# 3. Make your first call

curl https://api.float16.cloud/v1/chat/completions \
  -H "Authorization: Bearer $FLOAT16_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Quick Start: PaaS

# 1. Install CLI
pip install float16-cli

# 2. Login
float16 login

# 3. Deploy
float16 deploy ./my-model --gpu A10

Quick Start: IaaS

# 1. Install CLI
pip install float16-cli

# 2. Login
float16 login

# 3. Create instance
float16 instance create --gpu A100

# 4. Connect
float16 ssh my-instance

Conclusion

┌─────────────────────────────────────────────────────────┐
│              Choose Your Path                            │
│                                                         │
│  "I just want to use AI"                               │
│  → AaaS (API access)                                   │
│                                                         │
│  "I have a custom model to deploy"                     │
│  → PaaS (managed containers)                           │
│                                                         │
│  "I need full control for training"                    │
│  → IaaS (dedicated instances)                          │
│                                                         │
│  "I need all of the above"                             │
│  → Use all three layers for different workloads        │
└─────────────────────────────────────────────────────────┘

Congratulations!

You've completed the Serverless GPU course! You now understand:

  • The difference between serverless and dedicated GPU
  • When to use each approach
  • Cost analysis and decision frameworks
  • Float16's options for every use case

Ready to get started? Visit float16.cloud to create your account and start deploying GPU workloads today.

🎉 Congratulations!

You've completed all chapters in Serverless GPU

Explore More Courses