vLLM Playground

The vLLM Playground is an interactive environment for testing your deployed models with real-time inference. Available for GPU instances deployed with One-Click Deployment (vLLM tag).

Accessing the Playground

Deploy a model using GPU Instance > Create Instance > One-Click Deployment
Navigate to GPU Instance > Instances
Click View on a vLLM instance (instances with "vLLM" tag)
Select the Playground tab

Interface Overview

The Playground has two main sections: Settings and Chat.

Settings Panel

Setting	Description
Server Status	Health indicator (Healthy/Unhealthy) with refresh button
Port	Port selector (3000-4000), default 3900 for vLLM
Model	Select from available models on your instance
Temperature	Controls randomness (0-2), default 0.7
Max Tokens	Maximum response length (64-4096), default 512
Streaming	Enable real-time token streaming
Tool Calling	Enable function calling mode
Structured Output	Enable structured response format

Chat Panel

Endpoint URL: Displays the API endpoint for your model
View Code: Copy integration code (Python, cURL, JSON)
Message Input: Type messages (Shift+Enter for new line)
Clear Chat: Reset the conversation

Tool Calling

Test function calling with pre-configured example tools.

Enabling Tool Calling

Toggle Tool Calling to ON in Settings
Select Tool Choice:
- Auto: Model decides when to use tools
- Required: Model must use a tool
- None: Disable tool usage

Available Tools

The Playground includes 3 example tools:

Weather: Get weather information for a location
Calculator: Perform mathematical calculations
Search: Search for information

Try It

"What's the weather in Bangkok?"
"Calculate 25 * 4"

Learn more about Tool Calling

Structured Output

Generate responses in specific formats using JSON Schema, Regex patterns, or Choice constraints.

Enabling Structured Output

Toggle Structured Output to ON in Settings
Select an Output Format preset

Output Format Presets

Format	Type	Description
Person Info	JSON Schema	Extract name, age, occupation, email
Sentiment Analysis	JSON Schema	Analyze sentiment with confidence
Product Review	JSON Schema	Extract product review details
Simple JSON	JSON Schema	Basic key-value structure
Email Pattern	Regex	Match email format
Yes/No Choice	Choice	Binary response constraint
Rating Choice	Choice	Rating scale constraint

Try It

"Extract info: John Smith is a 32-year-old software engineer at john.smith@example.com"

Learn more about Structured Outputs

View Code

Click View Code to get integration examples:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1",
    api_key="not-needed"  # vLLM doesn't require API key
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    temperature=0.7,
    max_tokens=512,
    stream=False
)

print(response.choices[0].message.content)

cURL

curl -X POST https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Requirements

GPU instance deployed with One-Click Deployment (vLLM tag)
Instance status must be Running
vLLM server must be healthy (check Server Status indicator)

Troubleshooting

Server Status: Unhealthy

Verify the instance is running
Check the correct port (default 3900 for vLLM)
View instance Logs tab for errors
Wait for vLLM server to finish loading the model

No Models Available

The vLLM server may still be loading
Check Logs tab for model loading progress
Refresh the page after a few minutes

Next Steps

LLM Deployment - Deploy models
Tool Calling - Implement function calling
Structured Outputs - Generate structured responses