Documentation

LLM Deployment

Deploy and serve large language models on Float16 Cloud

LLM Deployment

Float16 provides a streamlined way to deploy and serve large language models (LLMs) in production. Deploy vLLM models instantly with One-Click Deployment on dedicated GPU instances.

Deploying an LLM

One-Click Deployment

  1. Navigate to GPU Instance > Create Instance
  2. Select the One-Click Deployment tab
  3. Enter a project name (optional)
  4. Select instance type (e.g., H100)
  5. Choose a model:
    • Preset Models: Select from the model catalog
    • Custom Model: Enter a HuggingFace model ID
  6. Configure volume size (50-10,000 GB, minimum 100 GB for model weights)
  7. Click Create Instance

Preset Models

Choose from pre-configured models optimized for vLLM:

Model Provider Capabilities Size
GPT-OSS-120B OpenAI text, reasoning, tools, grammar 70 GB
GPT-OSS-20B OpenAI text, reasoning, tools, grammar 20 GB
Qwen3-VL-235B-A22B-Instruct Alibaba text, vision, tools, grammar 240 GB
Qwen3-VL-30B-A3B-Instruct Alibaba text, vision, tools, grammar 30 GB
Qwen3-VL-32B-Instruct Alibaba text, vision, tools, grammar 32 GB
Qwen3-VL-8B-Instruct Alibaba text, vision, tools, grammar 8 GB
Llama 3.3 70B Instruct Meta text, tools, grammar 70 GB
Typhoon-ocr1.5-2b SCB10X text, vision, typhoon-ocr 6 GB
Typhoon2.5-qwen3-30b-a3b SCB10X text, tools, grammar 60 GB
GLM 4.7 Flash ZAI text, tools, grammar 30 GB

Model Capabilities

  • text: Text generation and chat
  • vision: Image understanding
  • reasoning: Advanced reasoning capabilities
  • tools: Function/tool calling support
  • grammar: Structured output support
  • typhoon-ocr: Thai/English document OCR

Custom Models

Deploy any compatible HuggingFace model:

  1. Select the Custom Model tab
  2. Enter the model ID in format: organization/model-name
    • Example: meta-llama/Llama-3.3-70B-Instruct
  3. Click Verify Model to check compatibility
  4. Configure volume size (ensure sufficient space for model weights)
  5. Click Create Instance

Endpoint Proxy

Access your deployed model via secure proxy endpoints.

Endpoint Configuration

  • Instance ID: Your unique instance identifier
  • Port: Configurable between 3000-4000 (vLLM API default: 3900)
  • Endpoint URL: https://proxy-instance.float16.cloud/{instance_id}/{port}/{path}

Usage Examples

cURL

curl https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model", "messages": [{"role": "user", "content": "Hello"}]}'

Python

import requests

response = requests.post(
    "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions",
    headers={"Content-Type": "application/json"},
    json={
        "model": "your-model",
        "messages": [{"role": "user", "content": "Hello"}]
    }
)
print(response.json())

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Streaming Responses

The proxy supports both JSON and Server-Sent Events (SSE) streaming:

JSON Response (Default)

Returns complete response after generation finishes.

SSE Streaming

Real-time token streaming. Add Accept: text/event-stream header and "stream": true in the request body.

import requests

response = requests.post(
    "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Accept": "text/event-stream"
    },
    json={
        "model": "your-model",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'))

Managing Deployments

Instance Dashboard

View and manage your deployments at GPU Instance > Instances.

Each instance includes:

Tab Description
Overview Instance details, status, GPU type, running time
Logs Real-time deployment logs
Endpoint Proxy URL, port configuration, usage examples
Playground Interactive testing environment

Instance Lifecycle

Action Description
Start Launch the instance
Stop Pause compute (only storage cost charged)
Resume Continue from where you left off
Terminate Permanently delete instance and resources

Instance History

Track all stop/resume sessions for your instance, including task ID, status, timestamps, and duration.

Proxy Information

  • Only ports 3000-4000 are accessible through the proxy
  • Endpoint format: https://proxy-instance.float16.cloud/${instance_id}/${port}/${path}
  • HTTPS is handled by the proxy - your internal service can use HTTP
  • For SSE streaming, include Accept: text/event-stream header

Pricing

Instance On-Demand Spot (Save 50%) Storage
H100 $4.32/hr $2.16/hr $1.00/GB/mo

View current pricing at GPU Instance > Pricing.

Next Steps

Tags:llmdeploymentinferencevllmhuggingface
Last updated: February 1, 20254 min read