Documentation

API Reference

Access your vLLM models via the OpenAI-compatible API

API Reference

Access your deployed vLLM models via the OpenAI-compatible API through the Float16 endpoint proxy.

Endpoint Proxy

Float16 provides a secure proxy to access services running on your GPU instances.

Endpoint Format

https://proxy-instance.float16.cloud/{instance_id}/{port}/{path}
Component Description
instance_id Your GPU instance ID (UUID)
port Port number (3000-4000, default 3900 for vLLM)
path API path (e.g., v1/chat/completions)

Finding Your Endpoint

  1. Navigate to GPU Instance > Instances
  2. Click View on your vLLM instance
  3. Select the Endpoint tab
  4. Copy the endpoint URL

OpenAI-Compatible API

vLLM deployments provide an OpenAI-compatible API. Use the standard OpenAI SDK with your Float16 endpoint.

Base URL

https://proxy-instance.float16.cloud/{instance_id}/3900/v1

Authentication

vLLM on Float16 does not require API key authentication. Set api_key to any value:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

Available Endpoints

Endpoint Method Description
/v1/chat/completions POST Generate chat completions
/v1/models GET List available models
/v1/health GET Check server health

Quick Start

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

cURL

curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

JavaScript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'https://proxy-instance.float16.cloud/{instance_id}/3900/v1'
});

const response = await client.chat.completions.create({
  model: 'your-model-name',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

Features

Streaming

Enable real-time token streaming:

stream = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Tool Calling

For models with tool calling support:

response = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {"type": "object", "properties": {...}}
        }
    }]
)

Learn more about Tool Calling

Structured Outputs

For models with grammar support:

response = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Extract: John is 30"}],
    extra_body={
        "guided_json": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"}
            }
        }
    }
)

Learn more about Structured Outputs

Proxy Information

  • Supported Ports: 3000-4000
  • Default vLLM Port: 3900
  • Protocol: HTTPS (SSL handled by proxy)
  • Streaming: Supports SSE with Accept: text/event-stream header

Next Steps

Tags:apiopenaivllmendpoint
Last updated: February 1, 20253 min read