Product
We offers 3 products.
1. AI as a Service
We choose to instruct and fine-tune models for specific languages and tasks, such as LLM models for Asian languages, Text-to-SQL, and code models.
- Tokenizer
- The difference in models is that they dont count tokens the same way OpenAI does. Some models, like SeaLLM, count words in local languages lower than OpenAI across several Asian languages. Typhoon and OpenThaiGPT also count Thai language words as lower than OpenAI by about three times.
- Prompt template
- For the OpenAI-compatible API version, we take care of the prompt template for you. You can use the same prompt as with the OpenAI API.
- Integration
- We support integration with well-known frameworks, including Langchain and LlamaIndex. You can check out some integration examples on our GitHub.
Pricing
Model | Input tokens (Million tokens) | Output tokens (Million tokens) |
---|---|---|
SeaLLM-7b-v2 | $0.20 | $0.20 |
SQLCoder-7b-2 | $0.60 | $0.20 |
OpenThaiGPT-7b | $0.20 | $0.20 |
OpenThaiGPT-13b | $0.30 | $0.30 |
OpenThaiGPT-70b | $0.90 | $0.90 |
Custom SLA
2. Custom model
We will automatically optimize them and provide both OpenAI-compatible API and custom API for you.
Feature
- 1. Streaming API
- We provide a streaming API for you to use. You can use it to stream data to our model and get the result back in real-time. This is useful for chatbots and other real-time applications.
- 2. OpenAI API like
- We provide an API that is compatible with OpenAI API. You can use the same prompt as with the OpenAI API. We also provide a custom API for you to use. You can use it to instruct the model to do specific tasks.
- 3. Auto merge request into one batch
- We provide a feature that automatically merges requests into one batch. This is useful for reducing the number of requests to the model and reducing the cost.
- 4. Auto scheduler
- We provide a feature that automatically schedules requests to the model. This is prevent the model from being overloaded and collapsing.
- 5. Optimization dashboard
- We provide a dashboard that shows the performance of the model. You can use it to monitor the performance of the model and optimize it.
Optimization
Optimization can reduce the VRAM usage of your model.
Topical optimization can reduce the model size by up to 50%, and some advanced techniques can reduce it to 25%. Beyond the benefits for VRAM, optimization can also help speed up your model for small batch sizes (less than 32), achieving a speed increase of 2-4 times.
This could enable you to load a model that is 2-3 times larger than your VRAM capacity.
Model size and VRAM.
Model size | Original (VRAM require) | Base optimization (VRAM require) | AWQ, GPTQ (VRAM require) |
---|---|---|---|
7b | 14 GB | 7 GB | 4 GB |
13b | 26 GB | 13 GB | 8 GB |
34b | 68 GB | 34 GB | 18 GB |
70b | 140 GB | 70 GB | 40 GB |
Pricing
Basic
GPU | VRAM | Price / hr. |
---|---|---|
A10Gx1 | 24 GB | $1.5 |
A10Gx4 | 96 GB | $6.0 |
Experiment (invited only)
GPU | VRAM | Price / hr. |
---|---|---|
A10Gx1 | 24 GB | $1.0 |
A10Gx4 | 96 GB | $4.0 |
High performance
GPU | VRAM | Price / hr. |
---|---|---|
A100x4 | 320 GB | Contact us |
H100x1 | 80 GB | Contact us |
H100x2 | 160 GB | Contact us |
H100x4 | 320 GB | Contact us |
H100x8 | 640 GB | Contact us |
How about GPU resources ?
Don't worry about GPU resources. We will take care of them for you.
Interest ?
3. Inference and Optimize Platform
How do we differ from other ?
Float16.Cloud offers a platform that enables your internal team to achieve performance on par with AI as a Service, with more flexibility than AI as a Service. Upload fine-tuned models as you wish, without the need for container configuration.
Highlight
- 1. Deploy, optimize, and move to production within 1 hour.
- 2. Reduce VRAM footprint by up to 75%.
- 3. Achieve faster inference speeds of up to 20-40 times.
- 4. Benchmarking is ready to use, including evaluation scores and speed.
Pricing
Topic | AI as a Service | Float16.cloud | Infrastructure as a Service |
---|---|---|---|
Deploy model | Weight | Weight | Container |
Model supported | Generative AI | Generative AI and Custom | Unlimited |
Optimizable | No | Yes | Yes |
GPU usage monitoring | No | Yes | No |
Inference speed | Fast | Fast | Depend |
Privacy | Platform | Private (On-premise available) | Platform |
Inference speed
Interest ?