One-Click Deployment
Float16's One-Click Deployment lets you deploy vLLM models instantly. Select from preset models or add custom HuggingFace models.
Overview
One-Click Deployment provides:
- vLLM Framework: High-throughput LLM serving with PagedAttention
- Preset Models: Optimized settings for popular models
- Custom Models: Deploy any HuggingFace model
- Endpoint Proxy: Secure access via proxy URLs
Deploying a Model
- Navigate to GPU Instance > Create Instance
- Select the One-Click Deployment tab
- Enter a project name (optional)
- Select instance type (e.g., H100)
- Choose a model:
- Preset Models: Select from the model catalog
- Custom Model: Enter a HuggingFace model ID
- Configure volume size (50-10000 GB)
- Click Create Instance
Preset Models
Choose from pre-configured models optimized for vLLM:
| Model | Provider | Capabilities | Size |
|---|---|---|---|
| GPT-OSS-120B | OpenAI | text, reasoning, tools, grammar | 70 GB |
| GPT-OSS-20B | OpenAI | text, reasoning, tools, grammar | 20 GB |
| Qwen3-VL-235B-A22B-Instruct | Alibaba | text, vision, tools, grammar | 240 GB |
| Qwen3-VL-30B-A3B-Instruct | Alibaba | text, vision, tools, grammar | 30 GB |
| Qwen3-VL-32B-Instruct | Alibaba | text, vision, tools, grammar | 32 GB |
| Qwen3-VL-8B-Instruct | Alibaba | text, vision, tools, grammar | 8 GB |
| Llama 3.3 70B Instruct | Meta | text, tools, grammar | 70 GB |
| Typhoon-ocr1.5-2b | SCB10X | text, vision, typhoon-ocr | 6 GB |
| Typhoon2.5-qwen3-30b-a3b | SCB10X | text, tools, grammar | 60 GB |
| GLM 4.7 Flash | ZAI | text, tools, grammar | 30 GB |
Model Capabilities
- text: Text generation and chat
- vision: Image understanding
- reasoning: Advanced reasoning capabilities
- tools: Function/tool calling support
- grammar: Structured output support
- typhoon-ocr: Thai/English document OCR
Custom Models
Deploy any compatible HuggingFace model:
- Select the Custom Model tab
- Enter the model ID in format:
organization/model-name- Example:
meta-llama/Llama-3.3-70B-Instruct
- Example:
- Click Verify Model to check compatibility
- Configure volume size (ensure sufficient space for model weights)
- Click Create Instance
Endpoint Proxy
Access your deployed model via secure proxy endpoints:
- Format:
https://proxy-instance.float16.cloud/{task_id}/{port}/{path} - Ports: 3000-4000 supported
- Compatible with: vLLM, custom APIs, Jupyter
Using with OpenAI SDK
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://proxy-instance.float16.cloud/{task_id}/3000/v1"
)
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
vLLM Playground
Test your deployed models with the interactive playground:
- Tool Calling: Test function calling with example tools
- Structured Outputs: JSON Schema, Regex patterns, Choice constraints
- Typhoon OCR: Extract text from Thai/English documents
- View Code: Copy Python, cURL, or JSON examples
Pricing
| Instance | On-Demand | Spot (Save 50%) | Storage |
|---|---|---|---|
| H100 | $4.32/hr | $2.16/hr | $1.00/GB/mo |
View current pricing at GPU Instance > Pricing.
Instance Lifecycle
Manage your deployment:
| Action | Description |
|---|---|
| Start | Launch the instance |
| Stop | Pause compute (only storage cost charged) |
| Resume | Continue from where you left off |
| Terminate | Permanently delete instance and resources |
View and manage instances at GPU Instance > Instances.
Next Steps
- GPU Platform Overview - Learn about Base VM instances
- Volumes & Storage - Manage persistent storage