Although the expectations that came with the advent of Large Language Models (LLMs) could be largely exaggerated, they still have proven to be useful in many scenarios and for a very wide audience. Probably, this is the closest non-technical users have come to interact with AI since its inception. ChatGPT has set the record for the fastest-growing user base.
Being an online tool, it comes with some concerns over privacy, customizability, and the constant need for an internet connection.
However ChatGPT is not the only player in this game. Many models are available online and for offline download. The latter is the focus of this post.
Recently, Meta release its latest model Llama 3.1. And I wanted to give it a try on my laptop, after positive feedback about it.
So, the objective I had was:
- Find and easy way to setup the tooling required to download LLMs and get responses to my prompts locally.
- Have a nice user interface that I can use to give prompts, save prompt history, and customize my environment.
Two tools play very well together:
Ollama: https://ollama.com/
Think of Ollama as the npm or pip for language models. It enables you to download models, execute prompts from the CLI, list models and so on. It also provides APIs that can be called from other applications. These APIs are compatible with OpenAI Chat Completions API.
Open WebUI: https://openwebui.com/
Self-hosted web interface, very similar to what you get from OpenAI's ChatGPT web interface. Open WebUI can interface with Ollama. Think about it as a front end for the backend provided by Ollama.
Since I prefer to use Docker whenever possible to experiment with new tools, I opted to use it instead of installing any tools locally. Especially that I'm almost a complete beginner to this space.
Open WebUI provides docker images that include both Open WebUI and Ollama in the same image! Which makes setting up the whole stack locally super easy.
The documentation provides this example command to run the container:
docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama
This command does the following:- Starts the container, from the image ghcr.io/open-webui/open-webui:ollama. The --restart always parameter restarts the container automatically if it fails or was stopped.
- Maps the local port 3000 to Open WebUI port 8080.
- Maps docker volumes to paths within the container. This persists data even if the container is deleted. This is important to persist history and downloaded models, which are big.
I prefer to use docker compose, additionally I wanted to have the easy visibility on the data created by Open WebUI and the models downloaded by Ollama, so I chose to bind folders on my local machine instead of using volumes. Here is how the docker-compose.yml looks like:
services:
OpenWebUI:
image: ghcr.io/open-webui/open-webui:ollama
container_name: open-webui
environment:
- WEBUI_AUTH=False
volumes:
- C:\open-webui:/app/backend/data
- C:\ollama:/root/.ollama
ports:
- 3000:8080
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Starting the stack is easy:
docker compose up -d
It may take a short while before it's ready, If you check the container logs (I use Docker Desktop on windows) you should see something similar to:
Then you can open your browser on http://localhost:3000/ and start playing.
Downloading models
Remember that this docker image does not include any models yet so probably the first step is to click the plus icon beside the "Select a model" label and write a model name. You'll find a list of model names in https://ollama.com/library. Click a model name, and choose the model size and the model name will be shown. So to download the 8b (8 billion parameters version) of llama3.1, write llama3.1:8b in the Open WebUI interface.
As shown in the screenshot below, I downloaded llama3.1:8b, gemma2:2b (Google's lightweight model). Note that the larger the number of parameters, the higher the specs your computer needs to have.
Testing with some prompts
After downloading models, you can try some prompts:
Note that you can interact with ollama CLI directly. Either use docker compose exec OpenWebUI bash or use docker desktop:
Now if you want to stop the container, run docker compose down
A note on GPUs
Ollama can run models using GPU, or CPU only. As you see in the docker compose file, I'm specifying that the container can use all the available GPUs on my machine. If you're using Docker desktop with WSL support, ensure that you have the latest WSL and Nvidia drivers installed. You can use this command to test that docker GPU access is working fine:
docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Closing notes:
It's very exciting to have an LLM running locally. It opens a lot of customization possibilities and keeps your data private.
The machine I tried this experiment on is relatively old, so gemma2:2b was much faster than llama3.1:7b and still performed very well.
Looking forward to experiment more!
Note: The title of this post was recommended by gemma2 :)