Documentation
Interactive

vLLM Playground

Interactive environment for testing deployed vLLM models

vLLM Playground

The vLLM Playground is an interactive environment for testing your deployed models with real-time inference. Available for GPU instances deployed with One-Click Deployment (vLLM tag).

Accessing the Playground

  1. Deploy a model using GPU Instance > Create Instance > One-Click Deployment
  2. Navigate to GPU Instance > Instances
  3. Click View on a vLLM instance (instances with "vLLM" tag)
  4. Select the Playground tab

Interface Overview

The Playground has two main sections: Settings and Chat.

Settings Panel

Setting Description
Server Status Health indicator (Healthy/Unhealthy) with refresh button
Port Port selector (3000-4000), default 3900 for vLLM
Model Select from available models on your instance
Temperature Controls randomness (0-2), default 0.7
Max Tokens Maximum response length (64-4096), default 512
Streaming Enable real-time token streaming
Tool Calling Enable function calling mode
Structured Output Enable structured response format

Chat Panel

  • Endpoint URL: Displays the API endpoint for your model
  • View Code: Copy integration code (Python, cURL, JSON)
  • Message Input: Type messages (Shift+Enter for new line)
  • Clear Chat: Reset the conversation

Tool Calling

Test function calling with pre-configured example tools.

Enabling Tool Calling

  1. Toggle Tool Calling to ON in Settings
  2. Select Tool Choice:
    • Auto: Model decides when to use tools
    • Required: Model must use a tool
    • None: Disable tool usage

Available Tools

The Playground includes 3 example tools:

  • Weather: Get weather information for a location
  • Calculator: Perform mathematical calculations
  • Search: Search for information

Try It

  • "What's the weather in Bangkok?"
  • "Calculate 25 * 4"

Learn more about Tool Calling

Structured Output

Generate responses in specific formats using JSON Schema, Regex patterns, or Choice constraints.

Enabling Structured Output

  1. Toggle Structured Output to ON in Settings
  2. Select an Output Format preset

Output Format Presets

Format Type Description
Person Info JSON Schema Extract name, age, occupation, email
Sentiment Analysis JSON Schema Analyze sentiment with confidence
Product Review JSON Schema Extract product review details
Simple JSON JSON Schema Basic key-value structure
Email Pattern Regex Match email format
Yes/No Choice Choice Binary response constraint
Rating Choice Choice Rating scale constraint

Try It

Learn more about Structured Outputs

View Code

Click View Code to get integration examples:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1",
    api_key="not-needed"  # vLLM doesn't require API key
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    temperature=0.7,
    max_tokens=512,
    stream=False
)

print(response.choices[0].message.content)

cURL

curl -X POST https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Requirements

  • GPU instance deployed with One-Click Deployment (vLLM tag)
  • Instance status must be Running
  • vLLM server must be healthy (check Server Status indicator)

Troubleshooting

Server Status: Unhealthy

  • Verify the instance is running
  • Check the correct port (default 3900 for vLLM)
  • View instance Logs tab for errors
  • Wait for vLLM server to finish loading the model

No Models Available

  • The vLLM server may still be loading
  • Check Logs tab for model loading progress
  • Refresh the page after a few minutes

Next Steps

Tags:vllmplaygroundtestingchatinteractive
Last updated: February 1, 20254 min read