One-Click Deployment

Float16's One-Click Deployment lets you deploy vLLM models instantly. Select from preset models or add custom HuggingFace models.

Overview

One-Click Deployment provides:

vLLM Framework: High-throughput LLM serving with PagedAttention
Preset Models: Optimized settings for popular models
Custom Models: Deploy any HuggingFace model
Endpoint Proxy: Secure access via proxy URLs

Deploying a Model

Navigate to GPU Instance > Create Instance
Select the One-Click Deployment tab
Enter a project name (optional)
Select instance type (e.g., H100)
Choose a model:
- Preset Models: Select from the model catalog
- Custom Model: Enter a HuggingFace model ID
Configure volume size (50-10000 GB)
Click Create Instance

Preset Models

Choose from pre-configured models optimized for vLLM:

Model	Provider	Capabilities	Size
GPT-OSS-120B	OpenAI	text, reasoning, tools, grammar	70 GB
GPT-OSS-20B	OpenAI	text, reasoning, tools, grammar	20 GB
Qwen3-VL-235B-A22B-Instruct	Alibaba	text, vision, tools, grammar	240 GB
Qwen3-VL-30B-A3B-Instruct	Alibaba	text, vision, tools, grammar	30 GB
Qwen3-VL-32B-Instruct	Alibaba	text, vision, tools, grammar	32 GB
Qwen3-VL-8B-Instruct	Alibaba	text, vision, tools, grammar	8 GB
Llama 3.3 70B Instruct	Meta	text, tools, grammar	70 GB
Typhoon-ocr1.5-2b	SCB10X	text, vision, typhoon-ocr	6 GB
Typhoon2.5-qwen3-30b-a3b	SCB10X	text, tools, grammar	60 GB
GLM 4.7 Flash	ZAI	text, tools, grammar	30 GB

Model Capabilities

text: Text generation and chat
vision: Image understanding
reasoning: Advanced reasoning capabilities
tools: Function/tool calling support
grammar: Structured output support
typhoon-ocr: Thai/English document OCR

Custom Models

Deploy any compatible HuggingFace model:

Select the Custom Model tab
Enter the model ID in format: organization/model-name
- Example: meta-llama/Llama-3.3-70B-Instruct
Click Verify Model to check compatibility
Configure volume size (ensure sufficient space for model weights)
Click Create Instance

Endpoint Proxy

Access your deployed model via secure proxy endpoints:

Format: https://proxy-instance.float16.cloud/{task_id}/{port}/{path}
Ports: 3000-4000 supported
Compatible with: vLLM, custom APIs, Jupyter

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://proxy-instance.float16.cloud/{task_id}/3000/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

vLLM Playground

Test your deployed models with the interactive playground:

Tool Calling: Test function calling with example tools
Structured Outputs: JSON Schema, Regex patterns, Choice constraints
Typhoon OCR: Extract text from Thai/English documents
View Code: Copy Python, cURL, or JSON examples

Pricing

Instance	On-Demand	Spot (Save 50%)	Storage
H100	$4.32/hr	$2.16/hr	$1.00/GB/mo

View current pricing at GPU Instance > Pricing.

Instance Lifecycle

Manage your deployment:

Action	Description
Start	Launch the instance
Stop	Pause compute (only storage cost charged)
Resume	Continue from where you left off
Terminate	Permanently delete instance and resources

View and manage instances at GPU Instance > Instances.

Next Steps

GPU Platform Overview - Learn about Base VM instances
Volumes & Storage - Manage persistent storage