Run Large Language Model On Your Own Computer

Ollama and Llama2

Mar 10, 2024

Due to the size of the model and computation resource to run inference on Large Language Models, it is almost impossible to run plain such model on personal computer. I have read many discussion, efforts, and tutorials on how to run large language model on small specification such as Macbook Pro M2 or personal computer that we are used to for gaming.

Some people mentioned they run the language model using Ollama. The moment when Llama2 released with stunning result and its capabilities. Llama2 also released several model sizes like 7B, 13B, and 70B params. But, I haven’t had opportunity to try it out on my computer. I usually rely on huggingface hub, but it wasn’t available due to the large of model weight. So, I come back again, scrolling on Reddit LocalLLama, and found a couple of users uses Ollama to run the model on low specs machine and today I have time to try it out.

Let’s get into it.

Note that I run Ollama using Llama2 7B quantize 4bit on Windows 11, WSL 2 Ubuntu 20.04, Intel i7, 32GB RAM, and GTX 1660 Super.

Installation

Visit Ollama download page

Here, I download Ollama for Linux

curl -fsSL https://ollama.com/install.sh | sh

Up and Running

You need to open 2 window terminal.

Run

ollama serve

in the second window, run the program using llama2

ollama run llama2

Then, you can write prompt as you usually do in ChatGPT or any chatbot platform

Endpoint

When running the Ollama, you can also access the model using endpoint API.

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
}'

And it will return a bunch of responses

{"model":"llama2","created_at":"2024-03-10T14:45:25.213304356Z","response":"\n","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.265805241Z","response":"B","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.315169474Z","response":"ird","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.362810377Z","response":"s","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.411185593Z","response":" fly","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.460384223Z","response":" by","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.50764932Z","response":" using","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.555239322Z","response":" their","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.603178331Z","response":" wings","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.655208408Z","response":" to","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.707275287Z","response":" generate","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.757585435Z","response":" lift","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.805507943Z","response":",","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.85393056Z","response":" which","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.902690282Z","response":" is","done":false}
{"model":"llama2","created_at":"2024-03-10T14:45:25.951120599Z","response":" the","done":false}
...

which apparently, you can flush the output the the web browser as it commonly appears on the chat platform.

The Beep

Discussion about this post