Chapter 5 of 5•7 min read

Float16 GPU Options

Explore Float16's serverless and dedicated GPU offerings - from AI-as-a-Service to full infrastructure control, find the right option for your workloads.

Float16 GPU Options

Float16 provides a full spectrum of GPU access options, from completely serverless to fully dedicated infrastructure. Let's explore each layer.

Float16's Three-Layer Approach

┌─────────────────────────────────────────────────────────┐
│                    Float16 Platform                      │
│                                                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │                    AaaS                          │   │
│  │            AI-as-a-Service                       │   │
│  │     (Serverless - API Only)                     │   │
│  └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│  ┌─────────────────────────────────────────────────┐   │
│  │                    PaaS                          │   │
│  │          Platform-as-a-Service                   │   │
│  │     (Managed Containers - Deploy Code)          │   │
│  └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│  ┌─────────────────────────────────────────────────┐   │
│  │                    IaaS                          │   │
│  │       Infrastructure-as-a-Service               │   │
│  │     (Dedicated VMs - Full Control)              │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Layer 1: AaaS (AI-as-a-Service)

The most serverless option - pure API access to AI models.

What You Get

┌─────────────────────────────────────────────────────────┐
│                    AaaS Features                         │
│                                                         │
│  ✓ Pre-deployed popular models                         │
│  ✓ REST API access                                     │
│  ✓ Pay-per-request pricing                             │
│  ✓ Auto-scaling (unlimited)                            │
│  ✓ Zero infrastructure management                      │
│  ✓ Web dashboard                                       │
│                                                         │
│  Available Models:                                      │
│  • LLMs (Llama, Qwen, Typhoon, etc.)                   │
│  • Image Generation (Stable Diffusion, FLUX)           │
│  • Speech (Whisper, TTS)                               │
│  • Embeddings                                          │
└─────────────────────────────────────────────────────────┘

Usage Example

import requests

# Simple API call - no GPU management
response = requests.post(
    "https://api.float16.cloud/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "llama-3-70b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ]
    }
)

print(response.json()["choices"][0]["message"]["content"])

AaaS Pricing

┌─────────────────────────────────────────────────────────┐
│              AaaS Pricing Examples                       │
│                                                         │
│  LLM Inference:                                         │
│  • Input:  $0.50 per 1M tokens                         │
│  • Output: $1.00 per 1M tokens                         │
│                                                         │
│  Image Generation:                                      │
│  • $0.02 per image (standard)                          │
│  • $0.05 per image (high-res)                          │
│                                                         │
│  Speech-to-Text:                                        │
│  • $0.006 per minute of audio                          │
└─────────────────────────────────────────────────────────┘

Best For

Quick prototypes
Variable/unpredictable traffic
Teams without ML infrastructure expertise
Applications using standard models
Cost-effective low-volume usage

Layer 2: PaaS (Platform-as-a-Service)

Deploy your own code on managed GPU infrastructure.

What You Get

┌─────────────────────────────────────────────────────────┐
│                    PaaS Features                         │
│                                                         │
│  ✓ Deploy custom models                                │
│  ✓ Upload your code                                    │
│  ✓ Managed containers                                  │
│  ✓ Auto-scaling                                        │
│  ✓ GPU resource allocation                             │
│  ✓ Jupyter notebooks                                   │
│  ✓ Model versioning                                    │
│                                                         │
│  You Provide:                                          │
│  • Your model files                                    │
│  • Inference code                                      │
│  • Requirements                                        │
│                                                         │
│  We Handle:                                            │
│  • Container orchestration                             │
│  • GPU allocation                                      │
│  • Scaling                                             │
│  • Health checks                                       │
└─────────────────────────────────────────────────────────┘

Deployment Example

# float16.yaml - Deployment configuration
name: my-custom-model
runtime: python3.11
gpu: A10

requirements:
  - torch==2.1.0
  - transformers==4.35.0

handler: inference.predict

scaling:
  min_instances: 1
  max_instances: 10
  target_gpu_utilization: 70%

# inference.py - Your custom inference code
from transformers import AutoModel

model = None

def load_model():
    global model
    model = AutoModel.from_pretrained("./my-model")

def predict(request):
    input_text = request["text"]
    result = model.generate(input_text)
    return {"output": result}

PaaS Pricing

┌─────────────────────────────────────────────────────────┐
│              PaaS Pricing Examples                       │
│                                                         │
│  Serverless GPU (pay per second):                       │
│  • T4:  $0.0002/second ($0.72/hour)                    │
│  • A10: $0.0005/second ($1.80/hour)                    │
│  • A100: $0.0012/second ($4.32/hour)                   │
│                                                         │
│  Min instances (always-on):                             │
│  • Charged at hourly rate                              │
│  • Reduced cold starts                                 │
│                                                         │
│  Storage:                                              │
│  • $0.10/GB/month                                      │
└─────────────────────────────────────────────────────────┘

Best For

Custom models
Fine-tuned models
Specific inference requirements
Development teams with ML expertise
Balance of control and convenience

Layer 3: IaaS (Infrastructure-as-a-Service)

Full control over dedicated GPU instances.

What You Get

┌─────────────────────────────────────────────────────────┐
│                    IaaS Features                         │
│                                                         │
│  ✓ Dedicated GPU instances                             │
│  ✓ Full SSH access                                     │
│  ✓ Root privileges                                     │
│  ✓ Install any software                                │
│  ✓ Persistent storage                                  │
│  ✓ Private networking                                  │
│  ✓ Snapshot/backup                                     │
│                                                         │
│  Instance Types:                                        │
│  • Single GPU (T4, A10, A100, H100)                   │
│  • Multi-GPU (up to 8x per instance)                  │
│  • CPU + GPU combinations                              │
│                                                         │
│  You Control:                                          │
│  • Operating system                                    │
│  • CUDA version                                        │
│  • All software                                        │
│  • Network configuration                               │
└─────────────────────────────────────────────────────────┘

Usage Example

# Create instance via CLI
float16 instance create \
  --name my-training-server \
  --gpu-type A100-80GB \
  --gpu-count 4 \
  --cpu 64 \
  --memory 512GB \
  --storage 2TB

# SSH access
float16 ssh my-training-server

# Inside the instance - full control
nvidia-smi
pip install torch
python train.py

IaaS Pricing

┌─────────────────────────────────────────────────────────┐
│              IaaS Pricing Examples                       │
│                                                         │
│  On-Demand (hourly):                                    │
│  • T4:      $0.50/hour                                 │
│  • A10:     $1.20/hour                                 │
│  • A100-40GB: $2.50/hour                               │
│  • A100-80GB: $3.50/hour                               │
│  • H100:    $5.00/hour                                 │
│                                                         │
│  Reserved (monthly commitment):                         │
│  • 20-40% discount                                     │
│                                                         │
│  Spot (interruptible):                                 │
│  • 50-70% discount                                     │
└─────────────────────────────────────────────────────────┘

Best For

Training workloads
Multi-GPU requirements
Custom environments
Compliance requirements
High utilization (>50%)
Full control needed

Comparison Table

Feature	AaaS	PaaS	IaaS
Control	None	Medium	Full
Setup Time	Minutes	Hours	Hours-Days
Scaling	Automatic	Automatic	Manual
Custom Models	No	Yes	Yes
Training	No	Limited	Yes
Cold Starts	Possible	Configurable	None
Min Cost	Pay-per-use	Pay-per-use	Hourly
Best For	API users	Developers	ML Engineers

Migration Path

Start simple, scale up as needed:

┌─────────────────────────────────────────────────────────┐
│              Typical Growth Journey                      │
│                                                         │
│  Stage 1: Prototype                                     │
│  └── AaaS: Test idea with API calls                    │
│                                                         │
│  Stage 2: Custom Model                                  │
│  └── PaaS: Deploy fine-tuned model                     │
│                                                         │
│  Stage 3: Scale                                         │
│  └── PaaS + Reserved: Predictable high volume          │
│                                                         │
│  Stage 4: Advanced                                      │
│  └── IaaS: Training, multi-GPU, custom infra           │
│                                                         │
│  Stage 5: Enterprise                                    │
│  └── IaaS + PaaS: Hybrid for different workloads      │
└─────────────────────────────────────────────────────────┘

Getting Started

Quick Start: AaaS

# 1. Sign up at float16.cloud
# 2. Get API key from dashboard
# 3. Make your first call

curl https://api.float16.cloud/v1/chat/completions \
  -H "Authorization: Bearer $FLOAT16_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Quick Start: PaaS

# 1. Install CLI
pip install float16-cli

# 2. Login
float16 login

# 3. Deploy
float16 deploy ./my-model --gpu A10

Quick Start: IaaS

# 1. Install CLI
pip install float16-cli

# 2. Login
float16 login

# 3. Create instance
float16 instance create --gpu A100

# 4. Connect
float16 ssh my-instance

Conclusion

┌─────────────────────────────────────────────────────────┐
│              Choose Your Path                            │
│                                                         │
│  "I just want to use AI"                               │
│  → AaaS (API access)                                   │
│                                                         │
│  "I have a custom model to deploy"                     │
│  → PaaS (managed containers)                           │
│                                                         │
│  "I need full control for training"                    │
│  → IaaS (dedicated instances)                          │
│                                                         │
│  "I need all of the above"                             │
│  → Use all three layers for different workloads        │
└─────────────────────────────────────────────────────────┘

Congratulations!

You've completed the Serverless GPU course! You now understand:

The difference between serverless and dedicated GPU
When to use each approach
Cost analysis and decision frameworks
Float16's options for every use case

Ready to get started? Visit float16.cloud to create your account and start deploying GPU workloads today.