Module 2: Environment Setup

Setting Up Your Environment

Learn how to set up your development environment, create a Float16.cloud account, and configure your first GPU deployment.

Module 2: Setting Up Your Environment

In this module, you'll set up everything you need to deploy LLMs on Float16.cloud. By the end, you'll have a working environment ready for deployment.

Creating Your Float16.cloud Account

Step 1: Sign Up

  1. Visit float16.cloud
  2. Click "Sign Up" in the top right
  3. Choose your authentication method:
    • Email/Password
    • GitHub OAuth
    • Google OAuth

Step 2: Verify Your Email

Check your inbox for a verification email and click the confirmation link.

Step 3: Complete Your Profile

Provide basic information:

  • Organization name (optional)
  • Use case description
  • Estimated monthly usage

Installing the Float16 CLI

The Float16 CLI is your primary tool for managing deployments.

Installation

Using pip:

pip install float16-cli

Using conda:

conda install -c float16 float16-cli

Verify installation:

float16 --version
# Output: float16-cli version 1.2.3

Authentication

Log in to your account:

float16 login

This will:

  1. Open your browser
  2. Ask you to authorize the CLI
  3. Save credentials locally

Verify authentication:

float16 whoami
# Output: Logged in as: your-email@example.com

Choosing the Right GPU

Float16.cloud offers several GPU options. Choose based on your model size and performance requirements.

Available GPUs

GPU Model Memory Use Case Hourly Rate
RTX 4090 24 GB Small models (< 7B) $0.50
A10G 24 GB Small-medium models $0.80
A100 (40GB) 40 GB Medium models (7B-13B) $2.00
A100 (80GB) 80 GB Large models (13B-30B) $3.50
H100 80 GB Largest models, fastest $5.00

GPU Selection Guide

Use this decision tree:

Model Size?
├─ < 7B parameters
│  └─ RTX 4090 or A10G
│
├─ 7B - 13B parameters
│  ├─ Need speed? → A100 40GB
│  └─ Budget conscious? → A10G (with quantization)
│
├─ 13B - 30B parameters
│  └─ A100 80GB
│
└─ > 30B parameters
   └─ Multiple A100 80GB or H100

Container Setup

Basic Docker Configuration

Create a Dockerfile for your model:

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install PyTorch and transformers
RUN pip3 install torch transformers accelerate

# Copy your application
WORKDIR /app
COPY app.py .
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# Expose port
EXPOSE 8000

# Run the application
CMD ["python3", "app.py"]

Build and Test Locally

# Build the image
docker build -t my-llm-deployment .

# Test locally (requires GPU)
docker run --gpus all -p 8000:8000 my-llm-deployment

Creating Your First Project

Initialize a New Project

# Create project directory
mkdir my-llm-project
cd my-llm-project

# Initialize Float16 project
float16 init

This creates:

my-llm-project/
├── float16.yaml       # Configuration file
├── Dockerfile         # Container definition
├── app.py            # Your application
└── requirements.txt   # Python dependencies

Configure float16.yaml

Edit the configuration file:

name: my-llm-deployment
gpu: A100-40GB
replicas: 1

model:
  name: meta-llama/Llama-2-7b-hf
  framework: transformers

resources:
  memory: 32Gi
  cpu: 8

scaling:
  min_replicas: 1
  max_replicas: 5
  target_gpu_util: 80

environment:
  HF_TOKEN: ${HF_TOKEN}  # From environment variable

Testing Your Setup

Local Testing

Before deploying to Float16.cloud, test locally:

# test_local.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def test_model_loading():
    model_name = "meta-llama/Llama-2-7b-hf"

    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    print("Testing inference...")
    inputs = tokenizer("Hello, my name is", return_tensors="pt")
    outputs = model.generate(**inputs, max_length=50)

    result = tokenizer.decode(outputs[0])
    print(f"Result: {result}")

    return True

if __name__ == "__main__":
    test_model_loading()
    print("✅ Setup successful!")

Run the test:

python test_local.py

Environment Variables

Set up required environment variables:

# ~/.bashrc or ~/.zshrc
export HF_TOKEN="your_huggingface_token"
export FLOAT16_API_KEY="your_float16_api_key"
export FLOAT16_PROJECT="my-llm-deployment"

Reload your shell:

source ~/.bashrc  # or ~/.zshrc

Troubleshooting

Common Issues

Issue: CUDA not found

# Check CUDA installation
nvidia-smi

# If not working, install CUDA toolkit
sudo apt install nvidia-cuda-toolkit

Issue: Out of memory

# Use smaller precision
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # or torch.bfloat16
    device_map="auto",
    low_cpu_mem_usage=True
)

Issue: Float16 CLI not authenticated

# Re-authenticate
float16 logout
float16 login

Checklist

Before moving to the next module, ensure you have:

  • ✅ Created a Float16.cloud account
  • ✅ Installed and authenticated the Float16 CLI
  • ✅ Selected appropriate GPU for your model
  • ✅ Created and configured a project
  • ✅ Tested model loading locally
  • ✅ Set up environment variables

Next Steps

Now that your environment is ready, let's deploy your first model!

[Continue to Module 3: Model Deployment →]

🎉 Congratulations!

You've completed all chapters in Deploy LLM on GPU

Explore More Courses