Float16 GPU Options
Float16 provides a full spectrum of GPU access options, from completely serverless to fully dedicated infrastructure. Let's explore each layer.
Float16's Three-Layer Approach
┌─────────────────────────────────────────────────────────┐
│ Float16 Platform │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ AaaS │ │
│ │ AI-as-a-Service │ │
│ │ (Serverless - API Only) │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ PaaS │ │
│ │ Platform-as-a-Service │ │
│ │ (Managed Containers - Deploy Code) │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ IaaS │ │
│ │ Infrastructure-as-a-Service │ │
│ │ (Dedicated VMs - Full Control) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Layer 1: AaaS (AI-as-a-Service)
The most serverless option - pure API access to AI models.
What You Get
┌─────────────────────────────────────────────────────────┐
│ AaaS Features │
│ │
│ ✓ Pre-deployed popular models │
│ ✓ REST API access │
│ ✓ Pay-per-request pricing │
│ ✓ Auto-scaling (unlimited) │
│ ✓ Zero infrastructure management │
│ ✓ Web dashboard │
│ │
│ Available Models: │
│ • LLMs (Llama, Qwen, Typhoon, etc.) │
│ • Image Generation (Stable Diffusion, FLUX) │
│ • Speech (Whisper, TTS) │
│ • Embeddings │
└─────────────────────────────────────────────────────────┘
Usage Example
import requests
# Simple API call - no GPU management
response = requests.post(
"https://api.float16.cloud/v1/chat/completions",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"model": "llama-3-70b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}
)
print(response.json()["choices"][0]["message"]["content"])
AaaS Pricing
┌─────────────────────────────────────────────────────────┐
│ AaaS Pricing Examples │
│ │
│ LLM Inference: │
│ • Input: $0.50 per 1M tokens │
│ • Output: $1.00 per 1M tokens │
│ │
│ Image Generation: │
│ • $0.02 per image (standard) │
│ • $0.05 per image (high-res) │
│ │
│ Speech-to-Text: │
│ • $0.006 per minute of audio │
└─────────────────────────────────────────────────────────┘
Best For
- Quick prototypes
- Variable/unpredictable traffic
- Teams without ML infrastructure expertise
- Applications using standard models
- Cost-effective low-volume usage
Layer 2: PaaS (Platform-as-a-Service)
Deploy your own code on managed GPU infrastructure.
What You Get
┌─────────────────────────────────────────────────────────┐
│ PaaS Features │
│ │
│ ✓ Deploy custom models │
│ ✓ Upload your code │
│ ✓ Managed containers │
│ ✓ Auto-scaling │
│ ✓ GPU resource allocation │
│ ✓ Jupyter notebooks │
│ ✓ Model versioning │
│ │
│ You Provide: │
│ • Your model files │
│ • Inference code │
│ • Requirements │
│ │
│ We Handle: │
│ • Container orchestration │
│ • GPU allocation │
│ • Scaling │
│ • Health checks │
└─────────────────────────────────────────────────────────┘
Deployment Example
# float16.yaml - Deployment configuration
name: my-custom-model
runtime: python3.11
gpu: A10
requirements:
- torch==2.1.0
- transformers==4.35.0
handler: inference.predict
scaling:
min_instances: 1
max_instances: 10
target_gpu_utilization: 70%
# inference.py - Your custom inference code
from transformers import AutoModel
model = None
def load_model():
global model
model = AutoModel.from_pretrained("./my-model")
def predict(request):
input_text = request["text"]
result = model.generate(input_text)
return {"output": result}
PaaS Pricing
┌─────────────────────────────────────────────────────────┐
│ PaaS Pricing Examples │
│ │
│ Serverless GPU (pay per second): │
│ • T4: $0.0002/second ($0.72/hour) │
│ • A10: $0.0005/second ($1.80/hour) │
│ • A100: $0.0012/second ($4.32/hour) │
│ │
│ Min instances (always-on): │
│ • Charged at hourly rate │
│ • Reduced cold starts │
│ │
│ Storage: │
│ • $0.10/GB/month │
└─────────────────────────────────────────────────────────┘
Best For
- Custom models
- Fine-tuned models
- Specific inference requirements
- Development teams with ML expertise
- Balance of control and convenience
Layer 3: IaaS (Infrastructure-as-a-Service)
Full control over dedicated GPU instances.
What You Get
┌─────────────────────────────────────────────────────────┐
│ IaaS Features │
│ │
│ ✓ Dedicated GPU instances │
│ ✓ Full SSH access │
│ ✓ Root privileges │
│ ✓ Install any software │
│ ✓ Persistent storage │
│ ✓ Private networking │
│ ✓ Snapshot/backup │
│ │
│ Instance Types: │
│ • Single GPU (T4, A10, A100, H100) │
│ • Multi-GPU (up to 8x per instance) │
│ • CPU + GPU combinations │
│ │
│ You Control: │
│ • Operating system │
│ • CUDA version │
│ • All software │
│ • Network configuration │
└─────────────────────────────────────────────────────────┘
Usage Example
# Create instance via CLI
float16 instance create \
--name my-training-server \
--gpu-type A100-80GB \
--gpu-count 4 \
--cpu 64 \
--memory 512GB \
--storage 2TB
# SSH access
float16 ssh my-training-server
# Inside the instance - full control
nvidia-smi
pip install torch
python train.py
IaaS Pricing
┌─────────────────────────────────────────────────────────┐
│ IaaS Pricing Examples │
│ │
│ On-Demand (hourly): │
│ • T4: $0.50/hour │
│ • A10: $1.20/hour │
│ • A100-40GB: $2.50/hour │
│ • A100-80GB: $3.50/hour │
│ • H100: $5.00/hour │
│ │
│ Reserved (monthly commitment): │
│ • 20-40% discount │
│ │
│ Spot (interruptible): │
│ • 50-70% discount │
└─────────────────────────────────────────────────────────┘
Best For
- Training workloads
- Multi-GPU requirements
- Custom environments
- Compliance requirements
- High utilization (>50%)
- Full control needed
Comparison Table
| Feature | AaaS | PaaS | IaaS |
|---|---|---|---|
| Control | None | Medium | Full |
| Setup Time | Minutes | Hours | Hours-Days |
| Scaling | Automatic | Automatic | Manual |
| Custom Models | No | Yes | Yes |
| Training | No | Limited | Yes |
| Cold Starts | Possible | Configurable | None |
| Min Cost | Pay-per-use | Pay-per-use | Hourly |
| Best For | API users | Developers | ML Engineers |
Migration Path
Start simple, scale up as needed:
┌─────────────────────────────────────────────────────────┐
│ Typical Growth Journey │
│ │
│ Stage 1: Prototype │
│ └── AaaS: Test idea with API calls │
│ │
│ Stage 2: Custom Model │
│ └── PaaS: Deploy fine-tuned model │
│ │
│ Stage 3: Scale │
│ └── PaaS + Reserved: Predictable high volume │
│ │
│ Stage 4: Advanced │
│ └── IaaS: Training, multi-GPU, custom infra │
│ │
│ Stage 5: Enterprise │
│ └── IaaS + PaaS: Hybrid for different workloads │
└─────────────────────────────────────────────────────────┘
Getting Started
Quick Start: AaaS
# 1. Sign up at float16.cloud
# 2. Get API key from dashboard
# 3. Make your first call
curl https://api.float16.cloud/v1/chat/completions \
-H "Authorization: Bearer $FLOAT16_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Quick Start: PaaS
# 1. Install CLI
pip install float16-cli
# 2. Login
float16 login
# 3. Deploy
float16 deploy ./my-model --gpu A10
Quick Start: IaaS
# 1. Install CLI
pip install float16-cli
# 2. Login
float16 login
# 3. Create instance
float16 instance create --gpu A100
# 4. Connect
float16 ssh my-instance
Conclusion
┌─────────────────────────────────────────────────────────┐
│ Choose Your Path │
│ │
│ "I just want to use AI" │
│ → AaaS (API access) │
│ │
│ "I have a custom model to deploy" │
│ → PaaS (managed containers) │
│ │
│ "I need full control for training" │
│ → IaaS (dedicated instances) │
│ │
│ "I need all of the above" │
│ → Use all three layers for different workloads │
└─────────────────────────────────────────────────────────┘
Congratulations!
You've completed the Serverless GPU course! You now understand:
- The difference between serverless and dedicated GPU
- When to use each approach
- Cost analysis and decision frameworks
- Float16's options for every use case
Ready to get started? Visit float16.cloud to create your account and start deploying GPU workloads today.