LLM Deployment

Float16 provides a streamlined way to deploy and serve large language models (LLMs) in production. Deploy vLLM models instantly with One-Click Deployment on dedicated GPU instances.

Deploying an LLM

One-Click Deployment

Navigate to GPU Instance > Create Instance
Select the One-Click Deployment tab
Enter a project name (optional)
Select instance type (e.g., H100)
Choose a model:
- Preset Models: Select from the model catalog
- Custom Model: Enter a HuggingFace model ID
Configure volume size (50-10,000 GB, minimum 100 GB for model weights)
Click Create Instance

Preset Models

Choose from pre-configured models optimized for vLLM:

Model	Provider	Capabilities	Size
GPT-OSS-120B	OpenAI	text, reasoning, tools, grammar	70 GB
GPT-OSS-20B	OpenAI	text, reasoning, tools, grammar	20 GB
Qwen3-VL-235B-A22B-Instruct	Alibaba	text, vision, tools, grammar	240 GB
Qwen3-VL-30B-A3B-Instruct	Alibaba	text, vision, tools, grammar	30 GB
Qwen3-VL-32B-Instruct	Alibaba	text, vision, tools, grammar	32 GB
Qwen3-VL-8B-Instruct	Alibaba	text, vision, tools, grammar	8 GB
Llama 3.3 70B Instruct	Meta	text, tools, grammar	70 GB
Typhoon-ocr1.5-2b	SCB10X	text, vision, typhoon-ocr	6 GB
Typhoon2.5-qwen3-30b-a3b	SCB10X	text, tools, grammar	60 GB
GLM 4.7 Flash	ZAI	text, tools, grammar	30 GB

Model Capabilities

text: Text generation and chat
vision: Image understanding
reasoning: Advanced reasoning capabilities
tools: Function/tool calling support
grammar: Structured output support
typhoon-ocr: Thai/English document OCR

Custom Models

Deploy any compatible HuggingFace model:

Select the Custom Model tab
Enter the model ID in format: organization/model-name
- Example: meta-llama/Llama-3.3-70B-Instruct
Click Verify Model to check compatibility
Configure volume size (ensure sufficient space for model weights)
Click Create Instance

Endpoint Proxy

Access your deployed model via secure proxy endpoints.

Endpoint Configuration

Instance ID: Your unique instance identifier
Port: Configurable between 3000-4000 (vLLM API default: 3900)
Endpoint URL: https://proxy-instance.float16.cloud/{instance_id}/{port}/{path}

Usage Examples

cURL

curl https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model", "messages": [{"role": "user", "content": "Hello"}]}'

Python

import requests

response = requests.post(
    "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions",
    headers={"Content-Type": "application/json"},
    json={
        "model": "your-model",
        "messages": [{"role": "user", "content": "Hello"}]
    }
)
print(response.json())

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Streaming Responses

The proxy supports both JSON and Server-Sent Events (SSE) streaming:

JSON Response (Default)

Returns complete response after generation finishes.

SSE Streaming

Real-time token streaming. Add Accept: text/event-stream header and "stream": true in the request body.

import requests

response = requests.post(
    "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Accept": "text/event-stream"
    },
    json={
        "model": "your-model",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'))

Managing Deployments

Instance Dashboard

View and manage your deployments at GPU Instance > Instances.

Each instance includes:

Tab	Description
Overview	Instance details, status, GPU type, running time
Logs	Real-time deployment logs
Endpoint	Proxy URL, port configuration, usage examples
Playground	Interactive testing environment

Instance Lifecycle

Action	Description
Start	Launch the instance
Stop	Pause compute (only storage cost charged)
Resume	Continue from where you left off
Terminate	Permanently delete instance and resources

Instance History

Track all stop/resume sessions for your instance, including task ID, status, timestamps, and duration.

Proxy Information

Only ports 3000-4000 are accessible through the proxy
Endpoint format: https://proxy-instance.float16.cloud/${instance_id}/${port}/${path}
HTTPS is handled by the proxy - your internal service can use HTTP
For SSE streaming, include Accept: text/event-stream header

Pricing

Instance	On-Demand	Spot (Save 50%)	Storage
H100	$4.32/hr	$2.16/hr	$1.00/GB/mo

View current pricing at GPU Instance > Pricing.

Next Steps

vLLM Playground - Test your models
Tool Calling - Use function calling
Structured Outputs - Generate structured responses