Chat Completions API
The Chat Completions API allows you to generate text using your deployed vLLM models. It's fully compatible with the OpenAI API format.
Endpoint
POST https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions
Basic Request
Python
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
cURL
curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
}'
Request Parameters
Required
| Parameter | Type | Description |
|---|---|---|
model |
string | Model name on your instance |
messages |
array | Conversation messages |
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature |
float | 1.0 | Sampling temperature (0-2) |
top_p |
float | 1.0 | Nucleus sampling threshold |
max_tokens |
integer | Model max | Maximum tokens to generate |
stream |
boolean | false | Enable streaming response |
stop |
string/array | null | Stop sequences |
frequency_penalty |
float | 0 | Penalize frequent tokens |
presence_penalty |
float | 0 | Penalize present tokens |
n |
integer | 1 | Number of completions |
tools |
array | null | Available tools/functions |
tool_choice |
string/object | "auto" | Tool selection mode |
Message Object
{
"role": "user|assistant|system|tool",
"content": "Message content",
"name": "optional_name",
"tool_calls": [],
"tool_call_id": "for_tool_responses"
}
Response Format
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1706745600,
"model": "your-model-name",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 10,
"total_tokens": 30
}
}
Streaming
Enable real-time token streaming:
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)
stream = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Streaming Response Format
{
"id": "chatcmpl-abc123",
"object": "chat.completion.chunk",
"created": 1706745600,
"model": "your-model-name",
"choices": [
{
"index": 0,
"delta": {
"content": "Hello"
},
"finish_reason": null
}
]
}
cURL with Streaming
curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Tool Calling
Enable function calling for models with tool support:
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "What's the weather in Bangkok?"}
],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
}
},
"required": ["location"]
}
}
}
],
tool_choice="auto"
)
# Check for tool calls
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Structured Outputs
Generate JSON conforming to a schema using vLLM's guided generation:
JSON Schema
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "Extract: John is 30 years old"}
],
extra_body={
"guided_json": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
}
)
import json
data = json.loads(response.choices[0].message.content)
# {"name": "John", "age": 30}
Regex Pattern
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "Generate an email for John"}
],
extra_body={
"guided_regex": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
}
)
Choice Constraint
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "Is this positive or negative?"}
],
extra_body={
"guided_choice": ["positive", "negative", "neutral"]
}
)
Learn more about Structured Outputs
SDK Examples
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
],
temperature=0.7,
max_tokens=100
)
print(response.choices[0].message.content)
JavaScript/Node.js
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'not-needed',
baseURL: 'https://proxy-instance.float16.cloud/{instance_id}/3900/v1'
});
const response = await client.chat.completions.create({
model: 'your-model-name',
messages: [
{ role: 'user', content: 'Hello!' }
]
});
console.log(response.choices[0].message.content);
Error Handling
Error Response
{
"error": {
"message": "Error description",
"type": "invalid_request_error",
"code": "error_code"
}
}
Common Errors
| Error | Description |
|---|---|
model_not_found |
Model not available on instance |
context_length_exceeded |
Input too long for model |
server_error |
vLLM server error |
Handling Errors
from openai import APIError
try:
response = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Hello"}]
)
except APIError as e:
print(f"API error: {e.message}")
Best Practices
- Use System Messages: Set context and behavior
- Manage Context: Keep conversation history reasonable
- Handle Streaming: Use for long responses
- Set Temperature: Lower for factual, higher for creative
- Use Max Tokens: Control response length
Next Steps
- API Reference Overview - Endpoint proxy details
- Tool Calling - Function calling guide
- Structured Outputs - JSON schema generation
- vLLM Playground - Interactive testing