企业自托管 LLM #4

这是部署您自己的 LLM 模型的最后部分。在前几部分中设置所有必要的服务和工具后，让我们继续下载模型并创建 API 端点。

对于刚刚阅读这作为第一部分的人，您可以按照以下前几部分进行操作：

让我们开始第 4 部分！！

1. 创建项目并下载模型

# 1. 为项目创建文件夹
mkdir -p llm-chat-api
cd llm-chat-api
# 2. 下载模型（Llama 3.2 1B Q8_0）
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF \
Llama-3.2-1B-Instruct-Q8_0.gguf --local-dir model

模型文件将存储在 ./model/Llama-3.2-1B-Instruct-Q8_0.gguf

2. 创建 Python 文件以运行模型

创建 main.py 文件

# main.py
from llama_cpp import Llama
llm = Llama(
    model_path="model/Llama-3.2-1B-Instruct-Q8_0.gguf",
    n_gpu_layers=-1,
    verbose=False,
    chat_format='llama-3'
)
output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are an best assistant"},
        {"role": "user", "content": "Introduce yourself"}
    ]
)
print(output["choices"][0]["message"]["content"])

测试运行：

python3 main.py

3. 使用 FastAPI 打开 API

安装 FastAPI 和 Uvicorn（服务器）

pip install fastapi uvicorn pydantic

更新 main.py 为带有 POST 的 REST API

# main.py
from fastapi import FastAPI
from pydantic import BaseModel
from llama_cpp import Llama

app = FastAPI()

llm = Llama(
    model_path="model/Llama-3.2-1B-Instruct-Q8_0.gguf",
    n_gpu_layers=-1,
    verbose=False,
    chat_format='llama-3'
)

class PromptRequest(BaseModel):
    prompt: str

@app.post("/chat")
def chat(req: PromptRequest):
    response = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "You are an assistant who can help to answer general question"},
            {"role": "user", "content": req.prompt}
        ]
    )
    return {"response": response["choices"][0]["message"]["content"]}

4. 运行 API 服务器

uvicorn main:app --host 0.0.0.0 --port 8000

API 将在 http://localhost:8000/chat 可用

我们可以通过 curl 测试它：

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Describe a sunset over the ocean."}'

整体总结

从所有部分来看，我们可以创建一个简单的 LLM 端点供团队使用，无论是用于测试还是开发成各种产品。最后，我想请大家继续关注其他文章。我保证会有有趣的内容值得关注。

企业自托管 LLM #4

1. 创建项目并下载模型

2. 创建 Python 文件以运行模型

3. 使用 FastAPI 打开 API

4. 运行 API 服务器

整体总结

Related Articles

GPU 监控仪表板

使用 Midscene.js 和 Playwright 进行 AI 驱动的 E2E 测试

Nvidia GPU 驱动程序设置：AI 开发者的基本步骤