Self-Hosted LLMs for Enterprise #3
For those just encountering this part, you can go back and read the previous 2 parts at:
In the past 2 parts, we installed GPU Driver and connected GPU to Docker successfully. In this part, we'll install important tools necessary for downloading LLM models and running inference server via API.
Prerequisites:
- Python version 3.10 or higher
- git command line
- Use in Docker container or VM/EC2
- GPU Driver and NVIDIA Container Toolkit installed successfully
1. Install Hugging Face CLI
What is Hugging Face?
Hugging Face is like GitHub for AI models. You can:
- Download models (like LLaMA, Mistral, Phi-2)
- Share and find datasets for training models
- Work easily with Open Source community
For CLI installation, we recommend creating a Hugging Face Account first at https://huggingface.co/join
Install CLI
pip install huggingface-hub[cli]==0.23.2
If you get this warning after installation:
WARNING: The script huggingface-cli is installed in '/home/ubuntu/.local/bin' which is not on PATH
Add PATH with this command:
echo 'export PATH=$PATH:/home/ubuntu/.local/bin' >> ~/.bashrc
source ~/.bashrc
Then try running:
huggingface-cli --help
Create Access Token
- Go to Profile > Setting > Access Tokens
- Create new token
- Specify Token name
- Change Token type to READ
- Create Token
Login with token
Since some repositories require authentication before downloading models, I recommend logging in first for convenience:
huggingface-cli login <token>
2: Install llama-cpp-python
llama-cpp-python is an open source library for running lightweight LLMs, supporting CUDA for GPU use.
Declare necessary environment:
export CUDACXX=/usr/local/cuda-12.9/bin/nvcc
Install llama-cpp-python
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=75" pip install llama-cpp-python==0.3.8
CMAKE_CUDA_ARCHITECTURES=75 is the compute capability for A10G / T4 / V100 (check appropriate value for your machine's GPU at https://developer.nvidia.com/cuda-gpus For EC2 g5g we use for Demo, it's NVIDIA T4 GPU, so compute capability is 7.5 (in argument, remove decimal to get 75)
Part 3 Summary
In this part, we prepared by:
- Installing Hugging Face CLI to download LLM models from community
- Installing
llama-cpp-pythonfor GPU inference
Up to now, we've prepared all important tools. Next time, we'll actually deploy our own LLM and learn how to use it via API. Stay tuned!