Install guide · Intermediate · 20 min
Build and run llama.cpp from source
llama.cpp is the reference C and C++ implementation behind most local LLM tools, including Ollama and LM Studio. Building it from source gives you finer control over quantization, sampling, and which backend accelerator to use. It also tends to be the fastest path on Apple Silicon.
When to reach for llama.cpp directly
Use the upstream binaries when you want the latest performance work (recent releases are often weeks ahead of distribution packages), when you need a quantization scheme that downstream wrappers do not expose, or when you want to script a high-throughput inference workflow without an extra daemon. Casual chat needs do not justify the build step; Ollama exists for that.
Step 1. Install the build toolchain
macOS
xcode-select --install
brew install cmakeLinux (Debian / Ubuntu)
sudo apt update
sudo apt install build-essential cmake git
# For NVIDIA acceleration:
sudo apt install nvidia-cuda-toolkitWindows
Install Visual Studio Build Tools (with the C++ workload), CMake, and Git. The CUDA toolkit is optional but recommended on NVIDIA GPUs. Building from PowerShell is straightforward once these are on the path.
Step 2. Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Pick exactly one backend below.
# Apple Silicon (Metal):
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j
# NVIDIA (CUDA):
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
# AMD (ROCm):
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100
cmake --build build --config Release -j
# Vulkan (cross-vendor GPU, less mature):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j
# CPU-only:
cmake -B build
cmake --build build --config Release -jThe build produces several binaries in build/bin/. The two you will use most are llama-cli (interactive chat) and llama-server (the OpenAI-compatible HTTP server).
Step 3. Choose a GGUF quantization
llama.cpp uses the GGUF format. The quantization suffix you pick is the trade-off between disk footprint, memory use, and quality. Three are worth knowing about as starting points.
Q4_K_M— the most common 4-bit variant. Good quality, small footprint, the sensible default for most desktop use.Q5_K_M— a noticeable quality bump over Q4 with about 25% more memory. Worth it when you have headroom.Q8_0— 8-bit quantization. Very close to the original weights in quality, useful for benchmarks or production where size is less of a constraint.
Pre-quantized models live on Hugging Face under user accounts like TheBloke, bartowski and unsloth. Pick the GGUF file that matches your chosen quantization.
# Example: Qwen 2.5 7B Instruct, Q4_K_M
huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF \
Qwen2.5-7B-Instruct-Q4_K_M.gguf \
--local-dir ./models --local-dir-use-symlinks FalseStep 4. First inference
./build/bin/llama-cli \
--model ./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
--ctx-size 8192 \
--n-gpu-layers 999 \
--prompt "Explain how PagedAttention reduces KV cache memory."The --n-gpu-layers flag offloads as many layers as fit on the GPU. Setting it to a large number is shorthand for “everything you can.” If you run out of VRAM, llama.cpp will refuse to load and tell you how many layers it managed; lower the number until it fits, or pick a smaller quantization.
Step 5. Serve an OpenAI-compatible API
./build/bin/llama-server \
--model ./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
--ctx-size 8192 \
--n-gpu-layers 999 \
--host 0.0.0.0 \
--port 8080 \
--parallel 4 \
--cont-batchingThe server listens on http://localhost:8080 with an OpenAI-compatible chat completions endpoint at /v1/chat/completions. --parallel sets how many concurrent requests it handles, and --cont-batching turns on continuous batching for higher throughput when more than one request is in flight.
Step 6. Quantize a model yourself
If you want to convert and quantize a fresh Hugging Face model, the flow is unambiguous but multi-step. Convert to GGUF first, then quantize.
# Convert from a Hugging Face snapshot to FP16 GGUF
python convert_hf_to_gguf.py ./snapshots/some-model \
--outfile ./models/some-model.f16.gguf \
--outtype f16
# Quantize to Q4_K_M
./build/bin/llama-quantize \
./models/some-model.f16.gguf \
./models/some-model.Q4_K_M.gguf \
Q4_K_MTroubleshooting common failures
CUDA build fails with mismatched compiler
The CUDA toolkit is picky about which host compiler it accepts. On Ubuntu 24.04 with a recent CUDA version, you may need to install g++-12 and point CMake at it explicitly: cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_HOST_COMPILER=g++-12.
Server hangs on first request after restart
Almost always the model warm-up. The first generation after the model loads spends time building the KV cache. Subsequent requests are fast.
Tokens per second seem low for your GPU
Verify the GPU is actually being used (nvidia-smi on Linux). Check that --n-gpu-layers is high enough to keep the whole model on the GPU. Confirm flash attention is enabled in the server output banner.
Where to go next
With llama-server running, you can plug any OpenAI-compatible client into it. Pair it with Open WebUI for a chat interface, or wire it into VS Code through Continue.dev. For multi-GPU or multi-tenant workloads, llama.cpp is not the right tool; reach for vLLM instead.
