A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).
Key questions I'll address are:
- What is quantization?
- What is offloading?
- How to run Mixtral-8x7B for free?