Self-Hosted LLMs for Enterprise #4

This is the final part for deploying your own LLM model. After setting up all necessary services and tools in the previous parts, let's continue with downloading the model and creating an API Endpoint.

For those just reading this as the first part, you can follow the previous parts at:

Part 1

Part 2

Part 3

Let's start Part 4!!

1. Create Project and Download Model

# 1. Create folder for project
mkdir -p llm-chat-api
cd llm-chat-api
# 2. Download model (Llama 3.2 1B Q8_0)
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF \
Llama-3.2-1B-Instruct-Q8_0.gguf --local-dir model

The model file will be stored at ./model/Llama-3.2-1B-Instruct-Q8_0.gguf

2. Create Python File to Run Model

Create main.py file

# main.py
from llama_cpp import Llama
llm = Llama(
    model_path="model/Llama-3.2-1B-Instruct-Q8_0.gguf",
    n_gpu_layers=-1,
    verbose=False,
    chat_format='llama-3'
)
output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are an best assistant"},
        {"role": "user", "content": "Introduce yourself"}
    ]
)
print(output["choices"][0]["message"]["content"])

Test run with:

python3 main.py

3. Open API with FastAPI

Install FastAPI and Uvicorn (server)

pip install fastapi uvicorn pydantic

Update main.py to be REST API with POST

# main.py
from fastapi import FastAPI
from pydantic import BaseModel
from llama_cpp import Llama

app = FastAPI()

llm = Llama(
    model_path="model/Llama-3.2-1B-Instruct-Q8_0.gguf",
    n_gpu_layers=-1,
    verbose=False,
    chat_format='llama-3'
)

class PromptRequest(BaseModel):
    prompt: str

@app.post("/chat")
def chat(req: PromptRequest):
    response = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "You are an assistant who can help to answer general question"},
            {"role": "user", "content": req.prompt}
        ]
    )
    return {"response": response["choices"][0]["message"]["content"]}

4. Run API Server

uvicorn main:app --host 0.0.0.0 --port 8000

API will be available at http://localhost:8000/chat

We can test it via curl:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Describe a sunset over the ocean."}'

Overall Summary

From all the parts, we can create a simple LLM endpoint for team use, whether for testing or developing into various products. Finally, I'd like to ask everyone to keep following other articles. I guarantee there will be interesting content to follow.

Self-Hosted LLMs for Enterprise #4

1. Create Project and Download Model

2. Create Python File to Run Model

3. Open API with FastAPI

4. Run API Server

Overall Summary

Related Articles

GPU Monitoring Dashboard

AI-Powered E2E Testing with Midscene.js and Playwright

Nvidia GPU Driver Setup: Essential Steps for AI Developers