Install guide · Beginner · 10 min

Install Ollama and run your first local model

Ollama is the shortest path to a local large language model. It handles downloading, quantization, GPU offloading, and exposes an OpenAI-compatible API on localhost:11434. This guide takes you from a clean machine to a working setup in roughly ten minutes. Unfamiliar with any of these terms? The glossary defines them in plain English.

What you need before starting

Ollama runs on macOS, Linux and Windows. The amount of RAM you have is the main thing that decides which models you can realistically load. A short reference: an 8B model in 4-bit quantization needs about 5 GB of memory, a 14B around 9 GB, a 32B around 20 GB, a 70B around 42 GB. Apple Silicon Macs benefit from unified memory; on a discrete-GPU PC the model needs to fit in VRAM for full speed, otherwise Ollama spills to system RAM at a substantial speed cost.

Step 1. Install Ollama

On macOS and Windows, download the installer from ollama.com/download. The macOS build runs as a menu-bar app, the Windows build adds a system tray icon. Both expose the ollama command on your path.

On Linux, the official script does the right thing on most distributions:

curl -fsSL https://ollama.com/install.sh | sh

The script registers a systemd service named ollama.service and starts the daemon. Check that everything is wired up:

ollama --version
systemctl status ollama   # Linux only

Step 2. Pull a first model

For a first run, pick a small model so you can see Ollama working before tying up bandwidth on something larger. Llama 3.1 8B or Qwen 3.5 7B are sensible defaults; both run well on 16 GB of memory and finish their downloads in a few minutes on a normal connection.

ollama pull llama3.1:8b

Ollama caches models under ~/.ollama/models on macOS and Linux, and under %USERPROFILE%\\.ollama\\models on Windows. If your home volume is small, set the OLLAMA_MODELS environment variable before starting the service to point at a larger disk.

Step 3. Chat with the model from the terminal

ollama run llama3.1:8b

You will get an interactive prompt. Try a question and watch the tokens stream back. To exit, type /bye. To list local models, run ollama list. To remove one, run ollama rm llama3.1:8b.

Step 4. Use the OpenAI-compatible API

Ollama exposes an HTTP API on http://localhost:11434. The OpenAI-compatible endpoint lives at /v1, which means most clients written for the OpenAI Python or JavaScript SDKs work with a two-line change. Set the base URL and a placeholder API key:

# Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain MoE in two sentences."}],
)
print(response.choices[0].message.content)

The same trick works in any tool that lets you override the API base URL: VS Code AI extensions, LangChain, LlamaIndex, OpenWebUI, and most chat clients on the desktop.

Step 5. Tune for your hardware

Two environment variables matter on day one. OLLAMA_NUM_PARALLEL sets how many concurrent requests Ollama serves; the default is fine for personal use, but raise it for shared developer servers. OLLAMA_KEEP_ALIVE controls how long Ollama keeps a model in memory after the last request; the default of five minutes is wasted time if you are using the same model all day.

# Linux
sudo systemctl edit ollama
# add the following under [Service]
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=24h"
sudo systemctl restart ollama

Step 6. Pick a real model for your workload

Once the plumbing works, the right model depends on what you do with it. For everyday writing and chat, Llama 3.1 8B or Qwen 3.5 14B are sensible. For coding, try Qwen 2.5 Coder, DeepSeek Coder, or Codestral. For long documents and retrieval, look at Llama 4 Scout if you have the hardware. The model directory covers what each family is actually good at.

Troubleshooting the most common issues

The model loads but answers are extremely slow

Almost always a sign that the model has spilled out of VRAM into system RAM. Either pick a smaller quantization (the :q4_K_M tag is a good middle ground) or a smaller model. On Apple Silicon, check that Ollama is using the Metal backend — recent versions do this automatically.

Ollama cannot find your GPU

On Linux with NVIDIA cards, install a current driver and the CUDA toolkit before installing Ollama. On Windows, recent NVIDIA drivers include CUDA support out of the box. AMD ROCm support is present but less smooth; the project tracks compatibility on its GitHub.

The API returns 404 on /v1/chat/completions

You are probably on an older Ollama version. The OpenAI-compatible layer arrived in 2024 and has been stable since. Update with brew upgrade ollama on macOS, the installer on Windows, or the install script on Linux.

Where to go next

With Ollama running, two natural next steps. Add Open WebUI for a chat interface that other people on your network can use, or wire Ollama into a coding workflow with Continue.dev in VS Code. For maximum performance on a single machine, the llama.cpp guide walks through a from-source build with hand-tuned quantization.