written by Eric J. Ma on 2024-02-21 | tags: gpu deep learning ollama llm tailscale linux ubuntu gpu llamabot
At home, I have a relatively idle GPU tower. It's something I bought way back in 2016 to do deep learning. It has an NVIDIA GTX1080 GPU in there with 8GB of RAM. By today's standards, it's puny. Over the years, however, I've used it less frequently to do GPU-heavy things because of time. But I recently found a way to give it a new lease of life: running an Ollama server on my home's private network! I wanted to share how I made that happen in this blog post.
I have all my personal devices (my M1 MacBook Air, phone, tablet, a DigitalOcean server running Dokku, NAS, and my home GPU box) running on a Tailscale virtual private network. Since my home GPU box is running Ubuntu Linux, I used the official Tailscale Linux installation instructions to get Tailscale installed on my GPU box, ensuring that it was on the same VPN as my MacBook.
Once I did that, I then installed Ollama on my GPU box. While ssh
-ed into my GPU server, I executed the command on the Ollama Linux installation page, which was:
curl -fsSL https://ollama.com/install.sh | sh
To verify that Ollama was installed correctly, on my GPU box, I executed the command:
ollama run mistral
Doing so allowed me to verify that Ollama was installed correctly.
By default, the Ollama web server runs on 127.0.0.1:11434, which doesn't allow for inbound connections from other computers. To change that behaviour, we must change the OLLAMA_HOST
environment variable to 0.0.0.0
. I followed the instructions in Ollama's documentation. To start, we edit the systemd
service:
systemctl edit ollama.service
Then, we add the following contents to the text file that gets opened up:
[Service] Environment="OLLAMA_HOST=0.0.0.0"
Finally, after saving and exiting the text file, we reload systemd
and restart Ollama:
systemctl daemon-reload systemctl restart ollama
Now, Ollama will be running on host 0.0.0.0
. To verify that it is running correctly, I went back to my laptop and ran the following curl
command:
curl http://<my-gpu-box-ip-address-here>:11434/api/chat -d '{ "model": "mistral", "messages": [ { "role": "user", "content": "hey there, how are you doing?" } ] }'
I got back a long stream of JSONs:
{"model":"mistral","created_at":"2024-02-21T01:53:12.747357134Z","message":{"role":"assistant","content":" Hello"},"done":false} {"model":"mistral","created_at":"2024-02-21T01:53:12.769246194Z","message":{"role":"assistant","content":"!"},"done":false} ... {"model":"mistral","created_at":"2024-02-21T01:53:14.054314656Z","message":{"role":"assistant","content":""},"done":true,"total_duration":2734292991,"load_duration":1320868996,"prompt_eval_count":17,"prompt_eval_duration":106030000,"eval_count":61,"eval_duration":1306913000}
I thus verified that I could connect to the Ollama server running on my GPU box!
Knowing Ollama's behaviour, I knew that the mistral model should be loaded into GPU memory for a little while before being taken down. To verify that it was indeed using the GPU, I ran:
nvidia-smi
Which gave me:
ericmjl in 🌐 ubuntu-gpu in ~ ❯ nvidia-smi Wed Feb 21 05:41:50 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | 27% 31C P2 50W / 180W | 4527MiB / 8192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1453 G /usr/lib/xorg/Xorg 18MiB | | 0 N/A N/A 2282 G /usr/bin/gnome-shell 2MiB | | 0 N/A N/A 3192354 C /usr/local/bin/ollama 4502MiB | +-----------------------------------------------------------------------------+
Perfect!
Taking it one step further, I decided to connect to my Ollama server using llamabot
's SimpleBot
class. In principle, it should be easy to do so because we have a LiteLLM pass-through for additional keyword arguments, and that meant I should be able to do so with:
import os system_prompt = "You are a funny bot!" bot = SimpleBot( model_name="ollama/mistral", # Specifying Ollama via the model_name argument is necessary when pointing to an Ollama server! system_prompt=system_prompt, stream_target="stdout", # this is the default! api_base=f"http://<my-gpu-box-ip-address-here>:11434", ) response = bot("Hello!")
And indeed, it works! I get back my usual mistral bot response:
Why, thank you! I'm here to make your day brighter with my witty and humorous remarks. So, tell me, why did the tomato turn red? Because it saw the salad dressing! Get it? *laughs manically* But seriously, how about we discuss something more important, like pizza or memes?
I can even easily swap out models (as long as they've been downloaded to my machine):
bot = SimpleBot( model_name="ollama/llama2:13b", # Specifying Ollama via the model_name argument is necessary when pointing to an Ollama server! system_prompt=system_prompt, stream_target="stdout", # this is the default! api_base=f"http://<my-gpu-box-ip-address-here>:11434", ) response = bot("Hello!")
This gives me:
WOOHOO! *party popper* OH MY GOSH, IT'S SO GLORIOUS TO BE A FUNNY BOT! *confetti* HELLO THERE, MY DEAR HUMAN FRIEND! *sunglasses* I'M READY TO BRING THE LAUGHS AND MAKE YOUR DAY A LITTLE BIT BRIGHTER! 😄❤️ WHAT CAN I DO FOR YOU, MY HUMAN PAL?
(Llama2 appears to have a goofier personality.)
One limitation (?) that I see right now is that Ollama needs to have downloaded a model before it can be used from SimpleBot. As an example, I don't have the Microsoft Phi2 model downloaded on my machine:
ericmjl in 🌐 ubuntu-gpu in ~ ❯ ollama list NAME ID SIZE MODIFIED llama2:13b d475bf4c50bc 7.4 GB 8 hours ago mistral:7b-text-q5_1 05b86a2ea9de 5.4 GB 8 hours ago mistral:latest 61e88e884507 4.1 GB 44 hours ago
Thus, when running SimpleBot using Phi:
bot = SimpleBot( model_name="ollama/phi", # phi is not on my GPU box! system_prompt=system_prompt, stream_target="stdout", # this is the default! api_base=f"http://<my-gpu-box-ip-address-here>:11434", ) response = bot("Hello!")
I get the following error:
{ "name": "ResponseNotRead", "message": "Attempted to access streaming response content, without having called `read()`.", "stack": "--------------------------------------------------------------------------- ResponseNotRead Traceback (most recent call last) Cell In[15], line 10 1 system_prompt = \"You are a funny bot!\" 3 bot = SimpleBot( 4 model_name=\"ollama/phi\", # Specifying Ollama via the model_name argument is necessary when pointing to an Ollama server! 5 system_prompt=system_prompt, 6 stream_target=\"stdout\", # this is the default! 7 api_base=f\"http://{os.getenv('OLLAMA_SERVER')}:11434\", 8 ) ---> 10 response = bot(\"Hello!\") ... File ~/anaconda/envs/llamabot/lib/python3.11/site-packages/httpx/_models.py:567, in Response.content(self) 564 @property 565 def content(self) -> bytes: 566 if not hasattr(self, \"_content\"): --> 567 raise ResponseNotRead() 568 return self._content ResponseNotRead: Attempted to access streaming response content, without having called `read()`." }
The way I solved this was by SSH-ing into my GPU box and running:
ollama pull phi
You can think of the Ollama server as being a curated and local library of models.
Because of Ollama, running an LLM server on my home private network was much easier than I initially imagined. LlamaBot - and its use of LiteLLM underneath the hood - enabled me to build bots that used the Ollama server. This turned out to be a great way to extend the usable life of my GPU box!
@article{
ericmjl-2024-llamabot-network,
author = {Eric J. Ma},
title = {LlamaBot with Ollama on my home virtual private network},
year = {2024},
month = {02},
day = {21},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2024/2/21/llamabot-with-ollama-on-my-home-virtual-private-network},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!