vLLM Playground
The vLLM Playground is an interactive environment for testing your deployed models with real-time inference. Available for GPU instances deployed with One-Click Deployment (vLLM tag).
Accessing the Playground
- Deploy a model using GPU Instance > Create Instance > One-Click Deployment
- Navigate to GPU Instance > Instances
- Click View on a vLLM instance (instances with "vLLM" tag)
- Select the Playground tab
Interface Overview
The Playground has two main sections: Settings and Chat.
Settings Panel
| Setting | Description |
|---|---|
| Server Status | Health indicator (Healthy/Unhealthy) with refresh button |
| Port | Port selector (3000-4000), default 3900 for vLLM |
| Model | Select from available models on your instance |
| Temperature | Controls randomness (0-2), default 0.7 |
| Max Tokens | Maximum response length (64-4096), default 512 |
| Streaming | Enable real-time token streaming |
| Tool Calling | Enable function calling mode |
| Structured Output | Enable structured response format |
Chat Panel
- Endpoint URL: Displays the API endpoint for your model
- View Code: Copy integration code (Python, cURL, JSON)
- Message Input: Type messages (Shift+Enter for new line)
- Clear Chat: Reset the conversation
Tool Calling
Test function calling with pre-configured example tools.
Enabling Tool Calling
- Toggle Tool Calling to ON in Settings
- Select Tool Choice:
- Auto: Model decides when to use tools
- Required: Model must use a tool
- None: Disable tool usage
Available Tools
The Playground includes 3 example tools:
- Weather: Get weather information for a location
- Calculator: Perform mathematical calculations
- Search: Search for information
Try It
- "What's the weather in Bangkok?"
- "Calculate 25 * 4"
Structured Output
Generate responses in specific formats using JSON Schema, Regex patterns, or Choice constraints.
Enabling Structured Output
- Toggle Structured Output to ON in Settings
- Select an Output Format preset
Output Format Presets
| Format | Type | Description |
|---|---|---|
| Person Info | JSON Schema | Extract name, age, occupation, email |
| Sentiment Analysis | JSON Schema | Analyze sentiment with confidence |
| Product Review | JSON Schema | Extract product review details |
| Simple JSON | JSON Schema | Basic key-value structure |
| Email Pattern | Regex | Match email format |
| Yes/No Choice | Choice | Binary response constraint |
| Rating Choice | Choice | Rating scale constraint |
Try It
- "Extract info: John Smith is a 32-year-old software engineer at john.smith@example.com"
Learn more about Structured Outputs
View Code
Click View Code to get integration examples:
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1",
api_key="not-needed" # vLLM doesn't require API key
)
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
temperature=0.7,
max_tokens=512,
stream=False
)
print(response.choices[0].message.content)
cURL
curl -X POST https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7,
"max_tokens": 512
}'
Requirements
- GPU instance deployed with One-Click Deployment (vLLM tag)
- Instance status must be Running
- vLLM server must be healthy (check Server Status indicator)
Troubleshooting
Server Status: Unhealthy
- Verify the instance is running
- Check the correct port (default 3900 for vLLM)
- View instance Logs tab for errors
- Wait for vLLM server to finish loading the model
No Models Available
- The vLLM server may still be loading
- Check Logs tab for model loading progress
- Refresh the page after a few minutes
Next Steps
- LLM Deployment - Deploy models
- Tool Calling - Implement function calling
- Structured Outputs - Generate structured responses