Module 2: Setting Up Your Environment
In this module, you'll set up everything you need to deploy LLMs on Float16.cloud. By the end, you'll have a working environment ready for deployment.
Creating Your Float16.cloud Account
Step 1: Sign Up
- Visit float16.cloud
- Click "Sign Up" in the top right
- Choose your authentication method:
- Email/Password
- GitHub OAuth
- Google OAuth
Step 2: Verify Your Email
Check your inbox for a verification email and click the confirmation link.
Step 3: Complete Your Profile
Provide basic information:
- Organization name (optional)
- Use case description
- Estimated monthly usage
Installing the Float16 CLI
The Float16 CLI is your primary tool for managing deployments.
Installation
Using pip:
pip install float16-cli
Using conda:
conda install -c float16 float16-cli
Verify installation:
float16 --version
# Output: float16-cli version 1.2.3
Authentication
Log in to your account:
float16 login
This will:
- Open your browser
- Ask you to authorize the CLI
- Save credentials locally
Verify authentication:
float16 whoami
# Output: Logged in as: your-email@example.com
Choosing the Right GPU
Float16.cloud offers several GPU options. Choose based on your model size and performance requirements.
Available GPUs
| GPU Model | Memory | Use Case | Hourly Rate |
|---|---|---|---|
| RTX 4090 | 24 GB | Small models (< 7B) | $0.50 |
| A10G | 24 GB | Small-medium models | $0.80 |
| A100 (40GB) | 40 GB | Medium models (7B-13B) | $2.00 |
| A100 (80GB) | 80 GB | Large models (13B-30B) | $3.50 |
| H100 | 80 GB | Largest models, fastest | $5.00 |
GPU Selection Guide
Use this decision tree:
Model Size?
├─ < 7B parameters
│ └─ RTX 4090 or A10G
│
├─ 7B - 13B parameters
│ ├─ Need speed? → A100 40GB
│ └─ Budget conscious? → A10G (with quantization)
│
├─ 13B - 30B parameters
│ └─ A100 80GB
│
└─ > 30B parameters
└─ Multiple A100 80GB or H100
Container Setup
Basic Docker Configuration
Create a Dockerfile for your model:
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Install PyTorch and transformers
RUN pip3 install torch transformers accelerate
# Copy your application
WORKDIR /app
COPY app.py .
COPY requirements.txt .
RUN pip3 install -r requirements.txt
# Expose port
EXPOSE 8000
# Run the application
CMD ["python3", "app.py"]
Build and Test Locally
# Build the image
docker build -t my-llm-deployment .
# Test locally (requires GPU)
docker run --gpus all -p 8000:8000 my-llm-deployment
Creating Your First Project
Initialize a New Project
# Create project directory
mkdir my-llm-project
cd my-llm-project
# Initialize Float16 project
float16 init
This creates:
my-llm-project/
├── float16.yaml # Configuration file
├── Dockerfile # Container definition
├── app.py # Your application
└── requirements.txt # Python dependencies
Configure float16.yaml
Edit the configuration file:
name: my-llm-deployment
gpu: A100-40GB
replicas: 1
model:
name: meta-llama/Llama-2-7b-hf
framework: transformers
resources:
memory: 32Gi
cpu: 8
scaling:
min_replicas: 1
max_replicas: 5
target_gpu_util: 80
environment:
HF_TOKEN: ${HF_TOKEN} # From environment variable
Testing Your Setup
Local Testing
Before deploying to Float16.cloud, test locally:
# test_local.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
def test_model_loading():
model_name = "meta-llama/Llama-2-7b-hf"
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
print("Testing inference...")
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0])
print(f"Result: {result}")
return True
if __name__ == "__main__":
test_model_loading()
print("✅ Setup successful!")
Run the test:
python test_local.py
Environment Variables
Set up required environment variables:
# ~/.bashrc or ~/.zshrc
export HF_TOKEN="your_huggingface_token"
export FLOAT16_API_KEY="your_float16_api_key"
export FLOAT16_PROJECT="my-llm-deployment"
Reload your shell:
source ~/.bashrc # or ~/.zshrc
Troubleshooting
Common Issues
Issue: CUDA not found
# Check CUDA installation
nvidia-smi
# If not working, install CUDA toolkit
sudo apt install nvidia-cuda-toolkit
Issue: Out of memory
# Use smaller precision
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # or torch.bfloat16
device_map="auto",
low_cpu_mem_usage=True
)
Issue: Float16 CLI not authenticated
# Re-authenticate
float16 logout
float16 login
Checklist
Before moving to the next module, ensure you have:
- ✅ Created a Float16.cloud account
- ✅ Installed and authenticated the Float16 CLI
- ✅ Selected appropriate GPU for your model
- ✅ Created and configured a project
- ✅ Tested model loading locally
- ✅ Set up environment variables
Next Steps
Now that your environment is ready, let's deploy your first model!
[Continue to Module 3: Model Deployment →]