Product

We offers 3 products.

  1. -> AI as a Service [1]

  2. -> Custom model [2]

  3. -> Inference and Optimize Platform [3]

1. AI as a Service

We choose to instruct and fine-tune models for specific languages and tasks, such as LLM models for Asian languages, Text-to-SQL, and code models.

Tokenizer
The difference in models is that they dont count tokens the same way OpenAI does. Some models, like SeaLLM, count words in local languages lower than OpenAI across several Asian languages. Typhoon and OpenThaiGPT also count Thai language words as lower than OpenAI by about three times.
Prompt template
For the OpenAI-compatible API version, we take care of the prompt template for you. You can use the same prompt as with the OpenAI API.
Integration
We support integration with well-known frameworks, including Langchain and LlamaIndex. You can check out some integration examples on our GitHub.

Pricing

Model
Input tokens
(Million tokens)
Output tokens
(Million tokens)
SeaLLM-7b-v2
$0.20$0.20
SQLCoder-7b-2
$0.60$0.20
OpenThaiGPT-7b
$0.20$0.20
OpenThaiGPT-13b
$0.30$0.30
OpenThaiGPT-70b
$0.90$0.90

Custom SLA

Contact us.

2. Custom model

We will automatically optimize them and provide both OpenAI-compatible API and custom API for you.

Feature

1. Streaming API
We provide a streaming API for you to use. You can use it to stream data to our model and get the result back in real-time. This is useful for chatbots and other real-time applications.
2. OpenAI API like
We provide an API that is compatible with OpenAI API. You can use the same prompt as with the OpenAI API. We also provide a custom API for you to use. You can use it to instruct the model to do specific tasks.
3. Auto merge request into one batch
We provide a feature that automatically merges requests into one batch. This is useful for reducing the number of requests to the model and reducing the cost.
4. Auto scheduler
We provide a feature that automatically schedules requests to the model. This is prevent the model from being overloaded and collapsing.
5. Optimization dashboard
We provide a dashboard that shows the performance of the model. You can use it to monitor the performance of the model and optimize it.

Optimization

Optimization can reduce the VRAM usage of your model.

Topical optimization can reduce the model size by up to 50%, and some advanced techniques can reduce it to 25%. Beyond the benefits for VRAM, optimization can also help speed up your model for small batch sizes (less than 32), achieving a speed increase of 2-4 times.

This could enable you to load a model that is 2-3 times larger than your VRAM capacity.

Model size and VRAM.

Model sizeOriginal (VRAM require)Base optimization (VRAM require)AWQ, GPTQ (VRAM require)
7b
14 GB7 GB4 GB
13b
26 GB13 GB8 GB
34b
68 GB34 GB18 GB
70b
140 GB70 GB40 GB

Pricing

Basic

GPU
VRAMPrice / hr.
A10Gx1
24 GB$1.5
A10Gx4
96 GB$6.0

Experiment (invited only)

GPU
VRAMPrice / hr.
A10Gx1
24 GB$1.0
A10Gx4
96 GB$4.0

High performance

GPU
VRAMPrice / hr.
A100x4
320 GBContact us
H100x1
80 GBContact us
H100x2
160 GBContact us
H100x4
320 GBContact us
H100x8
640 GBContact us

How about GPU resources ?

Don't worry about GPU resources. We will take care of them for you.

Interest ?

Subscript to our update.

3. Inference and Optimize Platform

How do we differ from other ?

Float16 is hybrid between AI as a Service and Infrastructure as a Service

Float16.Cloud offers a platform that enables your internal team to achieve performance on par with AI as a Service, with more flexibility than AI as a Service. Upload fine-tuned models as you wish, without the need for container configuration.

Highlight

1. Deploy, optimize, and move to production within 1 hour.
2. Reduce VRAM footprint by up to 75%.
3. Achieve faster inference speeds of up to 20-40 times.
4. Benchmarking is ready to use, including evaluation scores and speed.

Pricing

Topic
AI as a ServiceFloat16.cloud
Deploy model
WeightWeight
Model supported
Generative AIGenerative AI and Custom
Optimizable
NoYes
GPU usage monitoring
NoYes
Inference speed
FastFast
Privacy
PlatformPrivate (On-premise available)

Inference speed

Table show the inference speed of Mistral-7b on an A10G.

Interest ?

Contact us.