Documentation

One-Click Deployment

Deploy vLLM models instantly with pre-configured settings

One-Click Deployment

Float16's One-Click Deployment lets you deploy vLLM models instantly. Select from preset models or add custom HuggingFace models.

Overview

One-Click Deployment provides:

  • vLLM Framework: High-throughput LLM serving with PagedAttention
  • Preset Models: Optimized settings for popular models
  • Custom Models: Deploy any HuggingFace model
  • Endpoint Proxy: Secure access via proxy URLs

Deploying a Model

  1. Navigate to GPU Instance > Create Instance
  2. Select the One-Click Deployment tab
  3. Enter a project name (optional)
  4. Select instance type (e.g., H100)
  5. Choose a model:
    • Preset Models: Select from the model catalog
    • Custom Model: Enter a HuggingFace model ID
  6. Configure volume size (50-10000 GB)
  7. Click Create Instance

Preset Models

Choose from pre-configured models optimized for vLLM:

Model Provider Capabilities Size
GPT-OSS-120B OpenAI text, reasoning, tools, grammar 70 GB
GPT-OSS-20B OpenAI text, reasoning, tools, grammar 20 GB
Qwen3-VL-235B-A22B-Instruct Alibaba text, vision, tools, grammar 240 GB
Qwen3-VL-30B-A3B-Instruct Alibaba text, vision, tools, grammar 30 GB
Qwen3-VL-32B-Instruct Alibaba text, vision, tools, grammar 32 GB
Qwen3-VL-8B-Instruct Alibaba text, vision, tools, grammar 8 GB
Llama 3.3 70B Instruct Meta text, tools, grammar 70 GB
Typhoon-ocr1.5-2b SCB10X text, vision, typhoon-ocr 6 GB
Typhoon2.5-qwen3-30b-a3b SCB10X text, tools, grammar 60 GB
GLM 4.7 Flash ZAI text, tools, grammar 30 GB

Model Capabilities

  • text: Text generation and chat
  • vision: Image understanding
  • reasoning: Advanced reasoning capabilities
  • tools: Function/tool calling support
  • grammar: Structured output support
  • typhoon-ocr: Thai/English document OCR

Custom Models

Deploy any compatible HuggingFace model:

  1. Select the Custom Model tab
  2. Enter the model ID in format: organization/model-name
    • Example: meta-llama/Llama-3.3-70B-Instruct
  3. Click Verify Model to check compatibility
  4. Configure volume size (ensure sufficient space for model weights)
  5. Click Create Instance

Endpoint Proxy

Access your deployed model via secure proxy endpoints:

  • Format: https://proxy-instance.float16.cloud/{task_id}/{port}/{path}
  • Ports: 3000-4000 supported
  • Compatible with: vLLM, custom APIs, Jupyter

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://proxy-instance.float16.cloud/{task_id}/3000/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

vLLM Playground

Test your deployed models with the interactive playground:

  • Tool Calling: Test function calling with example tools
  • Structured Outputs: JSON Schema, Regex patterns, Choice constraints
  • Typhoon OCR: Extract text from Thai/English documents
  • View Code: Copy Python, cURL, or JSON examples

Pricing

Instance On-Demand Spot (Save 50%) Storage
H100 $4.32/hr $2.16/hr $1.00/GB/mo

View current pricing at GPU Instance > Pricing.

Instance Lifecycle

Manage your deployment:

Action Description
Start Launch the instance
Stop Pause compute (only storage cost charged)
Resume Continue from where you left off
Terminate Permanently delete instance and resources

View and manage instances at GPU Instance > Instances.

Next Steps

Tags:deploymentone-clickmodelsvllmhuggingface
Last updated: February 1, 20253 min read