API Reference
Access your deployed vLLM models via the OpenAI-compatible API through the Float16 endpoint proxy.
Endpoint Proxy
Float16 provides a secure proxy to access services running on your GPU instances.
Endpoint Format
https://proxy-instance.float16.cloud/{instance_id}/{port}/{path}
| Component | Description |
|---|---|
instance_id |
Your GPU instance ID (UUID) |
port |
Port number (3000-4000, default 3900 for vLLM) |
path |
API path (e.g., v1/chat/completions) |
Finding Your Endpoint
- Navigate to GPU Instance > Instances
- Click View on your vLLM instance
- Select the Endpoint tab
- Copy the endpoint URL
OpenAI-Compatible API
vLLM deployments provide an OpenAI-compatible API. Use the standard OpenAI SDK with your Float16 endpoint.
Base URL
https://proxy-instance.float16.cloud/{instance_id}/3900/v1
Authentication
vLLM on Float16 does not require API key authentication. Set api_key to any value:
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)
Available Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Generate chat completions |
/v1/models |
GET | List available models |
/v1/health |
GET | Check server health |
Quick Start
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="https://proxy-instance.float16.cloud/{instance_id}/3900/v1"
)
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
cURL
curl -X POST "https://proxy-instance.float16.cloud/{instance_id}/3900/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello!"}]
}'
JavaScript
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'not-needed',
baseURL: 'https://proxy-instance.float16.cloud/{instance_id}/3900/v1'
});
const response = await client.chat.completions.create({
model: 'your-model-name',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
Features
Streaming
Enable real-time token streaming:
stream = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Write a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Tool Calling
For models with tool calling support:
response = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "What's the weather?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {...}}
}
}]
)
Structured Outputs
For models with grammar support:
response = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Extract: John is 30"}],
extra_body={
"guided_json": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
}
}
)
Learn more about Structured Outputs
Proxy Information
- Supported Ports: 3000-4000
- Default vLLM Port: 3900
- Protocol: HTTPS (SSL handled by proxy)
- Streaming: Supports SSE with
Accept: text/event-streamheader
Next Steps
- Chat Completions API - Full API reference
- LLM Deployment - Deploy your models
- vLLM Playground - Test interactively