Self-Hosted LLMs for Enterprise #3

For those just encountering this part, you can go back and read the previous 2 parts at:

In the past 2 parts, we installed GPU Driver and connected GPU to Docker successfully. In this part, we'll install important tools necessary for downloading LLM models and running inference server via API.

Prerequisites:

Python version 3.10 or higher
git command line
Use in Docker container or VM/EC2
GPU Driver and NVIDIA Container Toolkit installed successfully

1. Install Hugging Face CLI

What is Hugging Face?

Hugging Face is like GitHub for AI models. You can:

Download models (like LLaMA, Mistral, Phi-2)
Share and find datasets for training models
Work easily with Open Source community

For CLI installation, we recommend creating a Hugging Face Account first at https://huggingface.co/join

Install CLI

pip install huggingface-hub[cli]==0.23.2

If you get this warning after installation:

WARNING: The script huggingface-cli is installed in '/home/ubuntu/.local/bin' which is not on PATH

Add PATH with this command:

echo 'export PATH=$PATH:/home/ubuntu/.local/bin' >> ~/.bashrc
source ~/.bashrc

Then try running:

huggingface-cli --help

Create Access Token

Go to Profile > Setting > Access Tokens
Create new token
Specify Token name
Change Token type to READ
Create Token

Since some repositories require authentication before downloading models, I recommend logging in first for convenience:

huggingface-cli login <token>

2: Install llama-cpp-python

llama-cpp-python is an open source library for running lightweight LLMs, supporting CUDA for GPU use.

Declare necessary environment:

export CUDACXX=/usr/local/cuda-12.9/bin/nvcc

Install llama-cpp-python

CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=75" pip install llama-cpp-python==0.3.8

CMAKE_CUDA_ARCHITECTURES=75 is the compute capability for A10G / T4 / V100 (check appropriate value for your machine's GPU at https://developer.nvidia.com/cuda-gpus For EC2 g5g we use for Demo, it's NVIDIA T4 GPU, so compute capability is 7.5 (in argument, remove decimal to get 75)

Part 3 Summary

In this part, we prepared by:

Installing Hugging Face CLI to download LLM models from community
Installing llama-cpp-python for GPU inference

Up to now, we've prepared all important tools. Next time, we'll actually deploy our own LLM and learn how to use it via API. Stay tuned!

Self-Hosted LLMs for Enterprise #3

1. Install Hugging Face CLI

What is Hugging Face?

Install CLI

Create Access Token

2: Install llama-cpp-python

Part 3 Summary

Related Articles

GPU Monitoring Dashboard

AI-Powered E2E Testing with Midscene.js and Playwright

Nvidia GPU Driver Setup: Essential Steps for AI Developers

Self-Hosted LLMs for Enterprise #3

1. Install Hugging Face CLI

What is Hugging Face?

Install CLI

Create Access Token

Login with token

2: Install llama-cpp-python

Part 3 Summary

Related Articles

GPU Monitoring Dashboard

AI-Powered E2E Testing with Midscene.js and Playwright

Nvidia GPU Driver Setup: Essential Steps for AI Developers