Documentation

Chat Completions API

Generate text with LLMs using the OpenAI-compatible Chat Completions API

Chat Completions API

The Chat Completions API allows you to generate text using your deployed vLLM models. It's fully compatible with the OpenAI API format.

Endpoint

POST https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions

Basic Request

Python

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

cURL

curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Request Parameters

Required

Parameter Type Description
model string Model name on your instance
messages array Conversation messages

Optional

Parameter Type Default Description
temperature float 1.0 Sampling temperature (0-2)
top_p float 1.0 Nucleus sampling threshold
max_tokens integer Model max Maximum tokens to generate
stream boolean false Enable streaming response
stop string/array null Stop sequences
frequency_penalty float 0 Penalize frequent tokens
presence_penalty float 0 Penalize present tokens
n integer 1 Number of completions
tools array null Available tools/functions
tool_choice string/object "auto" Tool selection mode

Message Object

{
  "role": "user|assistant|system|tool",
  "content": "Message content",
  "name": "optional_name",
  "tool_calls": [],
  "tool_call_id": "for_tool_responses"
}

Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706745600,
  "model": "your-model-name",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Streaming

Enable real-time token streaming:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

stream = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1706745600,
  "model": "your-model-name",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Hello"
      },
      "finish_reason": null
    }
  ]
}

cURL with Streaming

curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Tool Calling

Enable function calling for models with tool support:

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "What's the weather in Bangkok?"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    tool_choice="auto"
)

# Check for tool calls
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Learn more about Tool Calling

Structured Outputs

Generate JSON conforming to a schema using vLLM's guided generation:

JSON Schema

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Extract: John is 30 years old"}
    ],
    extra_body={
        "guided_json": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"}
            },
            "required": ["name", "age"]
        }
    }
)

import json
data = json.loads(response.choices[0].message.content)
# {"name": "John", "age": 30}

Regex Pattern

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Generate an email for John"}
    ],
    extra_body={
        "guided_regex": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
    }
)

Choice Constraint

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Is this positive or negative?"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative", "neutral"]
    }
)

Learn more about Structured Outputs

SDK Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=0.7,
    max_tokens=100
)

print(response.choices[0].message.content)

JavaScript/Node.js

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'https://proxy-instance.float16.cloud/{instance_id}/3900/v1'
});

const response = await client.chat.completions.create({
  model: 'your-model-name',
  messages: [
    { role: 'user', content: 'Hello!' }
  ]
});

console.log(response.choices[0].message.content);

Error Handling

Error Response

{
  "error": {
    "message": "Error description",
    "type": "invalid_request_error",
    "code": "error_code"
  }
}

Common Errors

Error Description
model_not_found Model not available on instance
context_length_exceeded Input too long for model
server_error vLLM server error

Handling Errors

from openai import APIError

try:
    response = client.chat.completions.create(
        model="your-model-name",
        messages=[{"role": "user", "content": "Hello"}]
    )
except APIError as e:
    print(f"API error: {e.message}")

Best Practices

  1. Use System Messages: Set context and behavior
  2. Manage Context: Keep conversation history reasonable
  3. Handle Streaming: Use for long responses
  4. Set Temperature: Lower for factual, higher for creative
  5. Use Max Tokens: Control response length

Next Steps

Tags:apillmchatcompletionsopenai
Last updated: February 1, 20255 min read