TutorialAI Translated Content

Self-Hosted LLMs for Enterprise #3

2 min read
Self-Hosted LLMs for Enterprise #3

For those just encountering this part, you can go back and read the previous 2 parts at:

Part 1

Part 2

In the past 2 parts, we installed GPU Driver and connected GPU to Docker successfully. In this part, we'll install important tools necessary for downloading LLM models and running inference server via API.

Prerequisites:

  • Python version 3.10 or higher
  • git command line
  • Use in Docker container or VM/EC2
  • GPU Driver and NVIDIA Container Toolkit installed successfully

1. Install Hugging Face CLI

What is Hugging Face?

Hugging Face is like GitHub for AI models. You can:

  • Download models (like LLaMA, Mistral, Phi-2)
  • Share and find datasets for training models
  • Work easily with Open Source community

For CLI installation, we recommend creating a Hugging Face Account first at https://huggingface.co/join

Install CLI

pip install huggingface-hub[cli]==0.23.2

If you get this warning after installation:

WARNING: The script huggingface-cli is installed in '/home/ubuntu/.local/bin' which is not on PATH

Add PATH with this command:

echo 'export PATH=$PATH:/home/ubuntu/.local/bin' >> ~/.bashrc
source ~/.bashrc

Then try running:

huggingface-cli --help

Create Access Token

  • Go to Profile > Setting > Access Tokens
  • Create new token
  • Specify Token name
  • Change Token type to READ
  • Create Token

Login with token

Since some repositories require authentication before downloading models, I recommend logging in first for convenience:

huggingface-cli login <token>

2: Install llama-cpp-python

llama-cpp-python is an open source library for running lightweight LLMs, supporting CUDA for GPU use.

Declare necessary environment:

export CUDACXX=/usr/local/cuda-12.9/bin/nvcc

Install llama-cpp-python

CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=75" pip install llama-cpp-python==0.3.8

CMAKE_CUDA_ARCHITECTURES=75 is the compute capability for A10G / T4 / V100 (check appropriate value for your machine's GPU at https://developer.nvidia.com/cuda-gpus For EC2 g5g we use for Demo, it's NVIDIA T4 GPU, so compute capability is 7.5 (in argument, remove decimal to get 75)

Part 3 Summary

In this part, we prepared by:

  • Installing Hugging Face CLI to download LLM models from community
  • Installing llama-cpp-python for GPU inference

Up to now, we've prepared all important tools. Next time, we'll actually deploy our own LLM and learn how to use it via API. Stay tuned!