LLM Deployment
Float16 provides a streamlined way to deploy and serve large language models (LLMs) in production. Deploy vLLM models instantly with One-Click Deployment on dedicated GPU instances.
Deploying an LLM
One-Click Deployment
- Navigate to GPU Instance > Create Instance
- Select the One-Click Deployment tab
- Enter a project name (optional)
- Select instance type (e.g., H100)
- Choose a model:
- Preset Models: Select from the model catalog
- Custom Model: Enter a HuggingFace model ID
- Configure volume size (50-10,000 GB, minimum 100 GB for model weights)
- Click Create Instance
Preset Models
Choose from pre-configured models optimized for vLLM:
| Model | Provider | Capabilities | Size |
|---|---|---|---|
| GPT-OSS-120B | OpenAI | text, reasoning, tools, grammar | 70 GB |
| GPT-OSS-20B | OpenAI | text, reasoning, tools, grammar | 20 GB |
| Qwen3-VL-235B-A22B-Instruct | Alibaba | text, vision, tools, grammar | 240 GB |
| Qwen3-VL-30B-A3B-Instruct | Alibaba | text, vision, tools, grammar | 30 GB |
| Qwen3-VL-32B-Instruct | Alibaba | text, vision, tools, grammar | 32 GB |
| Qwen3-VL-8B-Instruct | Alibaba | text, vision, tools, grammar | 8 GB |
| Llama 3.3 70B Instruct | Meta | text, tools, grammar | 70 GB |
| Typhoon-ocr1.5-2b | SCB10X | text, vision, typhoon-ocr | 6 GB |
| Typhoon2.5-qwen3-30b-a3b | SCB10X | text, tools, grammar | 60 GB |
| GLM 4.7 Flash | ZAI | text, tools, grammar | 30 GB |
Model Capabilities
- text: Text generation and chat
- vision: Image understanding
- reasoning: Advanced reasoning capabilities
- tools: Function/tool calling support
- grammar: Structured output support
- typhoon-ocr: Thai/English document OCR
Custom Models
Deploy any compatible HuggingFace model:
- Select the Custom Model tab
- Enter the model ID in format:
organization/model-name- Example:
meta-llama/Llama-3.3-70B-Instruct
- Example:
- Click Verify Model to check compatibility
- Configure volume size (ensure sufficient space for model weights)
- Click Create Instance
Endpoint Proxy
Access your deployed model via secure proxy endpoints.
Endpoint Configuration
- Instance ID: Your unique instance identifier
- Port: Configurable between 3000-4000 (vLLM API default: 3900)
- Endpoint URL:
https://proxy-instance.float16.cloud/{instance_id}/{port}/{path}
Usage Examples
cURL
curl https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "your-model", "messages": [{"role": "user", "content": "Hello"}]}'
Python
import requests
response = requests.post(
"https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": "your-model",
"messages": [{"role": "user", "content": "Hello"}]
}
)
print(response.json())
OpenAI SDK
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)
response = client.chat.completions.create(
model="your-model",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Streaming Responses
The proxy supports both JSON and Server-Sent Events (SSE) streaming:
JSON Response (Default)
Returns complete response after generation finishes.
SSE Streaming
Real-time token streaming. Add Accept: text/event-stream header and "stream": true in the request body.
import requests
response = requests.post(
"https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Accept": "text/event-stream"
},
json={
"model": "your-model",
"messages": [{"role": "user", "content": "Hello"}],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
print(line.decode('utf-8'))
Managing Deployments
Instance Dashboard
View and manage your deployments at GPU Instance > Instances.
Each instance includes:
| Tab | Description |
|---|---|
| Overview | Instance details, status, GPU type, running time |
| Logs | Real-time deployment logs |
| Endpoint | Proxy URL, port configuration, usage examples |
| Playground | Interactive testing environment |
Instance Lifecycle
| Action | Description |
|---|---|
| Start | Launch the instance |
| Stop | Pause compute (only storage cost charged) |
| Resume | Continue from where you left off |
| Terminate | Permanently delete instance and resources |
Instance History
Track all stop/resume sessions for your instance, including task ID, status, timestamps, and duration.
Proxy Information
- Only ports 3000-4000 are accessible through the proxy
- Endpoint format:
https://proxy-instance.float16.cloud/${instance_id}/${port}/${path} - HTTPS is handled by the proxy - your internal service can use HTTP
- For SSE streaming, include
Accept: text/event-streamheader
Pricing
| Instance | On-Demand | Spot (Save 50%) | Storage |
|---|---|---|---|
| H100 | $4.32/hr | $2.16/hr | $1.00/GB/mo |
View current pricing at GPU Instance > Pricing.
Next Steps
- vLLM Playground - Test your models
- Tool Calling - Use function calling
- Structured Outputs - Generate structured responses