Chat Completions API

The Chat Completions API allows you to generate text using your deployed vLLM models. It's fully compatible with the OpenAI API format.

Endpoint

POST https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions

Basic Request

Python

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

cURL

curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Request Parameters

Required

Parameter	Type	Description
`model`	string	Model name on your instance
`messages`	array	Conversation messages

Optional

Parameter	Type	Default	Description
`temperature`	float	1.0	Sampling temperature (0-2)
`top_p`	float	1.0	Nucleus sampling threshold
`max_tokens`	integer	Model max	Maximum tokens to generate
`stream`	boolean	false	Enable streaming response
`stop`	string/array	null	Stop sequences
`frequency_penalty`	float	0	Penalize frequent tokens
`presence_penalty`	float	0	Penalize present tokens
`n`	integer	1	Number of completions
`tools`	array	null	Available tools/functions
`tool_choice`	string/object	"auto"	Tool selection mode

Message Object

{
  "role": "user|assistant|system|tool",
  "content": "Message content",
  "name": "optional_name",
  "tool_calls": [],
  "tool_call_id": "for_tool_responses"
}

Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706745600,
  "model": "your-model-name",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Streaming

Enable real-time token streaming:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

stream = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1706745600,
  "model": "your-model-name",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Hello"
      },
      "finish_reason": null
    }
  ]
}

cURL with Streaming

curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Tool Calling

Enable function calling for models with tool support:

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "What's the weather in Bangkok?"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    tool_choice="auto"
)

# Check for tool calls
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Learn more about Tool Calling

Structured Outputs

Generate JSON conforming to a schema using vLLM's guided generation:

JSON Schema

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Extract: John is 30 years old"}
    ],
    extra_body={
        "guided_json": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"}
            },
            "required": ["name", "age"]
        }
    }
)

import json
data = json.loads(response.choices[0].message.content)
# {"name": "John", "age": 30}

Regex Pattern

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Generate an email for John"}
    ],
    extra_body={
        "guided_regex": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
    }
)

Choice Constraint

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Is this positive or negative?"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative", "neutral"]
    }
)

Learn more about Structured Outputs

SDK Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=0.7,
    max_tokens=100
)

print(response.choices[0].message.content)

JavaScript/Node.js

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'https://proxy-instance.float16.cloud/{instance_id}/3900/v1'
});

const response = await client.chat.completions.create({
  model: 'your-model-name',
  messages: [
    { role: 'user', content: 'Hello!' }
  ]
});

console.log(response.choices[0].message.content);

Error Handling

Error Response

{
  "error": {
    "message": "Error description",
    "type": "invalid_request_error",
    "code": "error_code"
  }
}

Common Errors

Error	Description
`model_not_found`	Model not available on instance
`context_length_exceeded`	Input too long for model
`server_error`	vLLM server error

Handling Errors

from openai import APIError

try:
    response = client.chat.completions.create(
        model="your-model-name",
        messages=[{"role": "user", "content": "Hello"}]
    )
except APIError as e:
    print(f"API error: {e.message}")

Best Practices

Use System Messages: Set context and behavior
Manage Context: Keep conversation history reasonable
Handle Streaming: Use for long responses
Set Temperature: Lower for factual, higher for creative
Use Max Tokens: Control response length

Next Steps

API Reference Overview - Endpoint proxy details
Tool Calling - Function calling guide
Structured Outputs - JSON schema generation
vLLM Playground - Interactive testing