A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).
- What is quantization?
- What is offloading?
- How to run Mixtral-8x7B for free?
What is quantization?
Quantization generally is a process that converts continuous values to a discrete set of values. A tangible analogy would be how we tell time: time is continuous, and we use hours, minutes, and seconds to "quantize" time. Sometimes "around 10am" is good enough, and sometimes we want to be precise to the millisecond. In the context of deep learning, it is a technique to reduce the computational and memory costs of running a model by using lower-precision numerical types to represent its weights and activations. In simpler terms, we are trying to be less precise with the numbers that makes up the model weights so that it takes up less memory and can perform operations faster. This StackOverflow blog has great visualizations of how quantization works. Without repeating the main content, the key takeaway for me was this image: The fewer bits you use per pixel, the less memory needed, but the image quality also may decrease.
Weights normally use 32-bit floating points, and are often quantized to float16 or int8. Ollama uses 4-bit quantization, which means instead of using 32 "101100..."s for one value, only 4 are used. That means theoretically, you get 8x savings in memory usage. Quantization can be broadly grouped into 2 main methods:
- Post Training Quantization (PTQ): Done after a model is trained. A "calibration dataset" is used to capture the distribution of activations to calculate quant parameters (scale, zero point) for all inputs. No re-training is needed.
- Quantization Aware Training (QAT): Models are quantized during re-training/finetuning where low precision behaviour is simulated in the forward pass (backward pass remains the same). QAT is often able to better preserve accuracy when compared to PTQ, but incurs a high cost from re-training, which may not be suitable for LLMs.
When it comes to quantizing LLM weights, methods like NF4, GPTQ, AWQ, and GGML/GGUF are among the most popular. A key observation is that not all weights are equally important (as pointed out in the AWQ paper). Keeping higher precision for more critical layers / weights proved to be key in the balance of accuracy and resource usage. In particular, the Mixtral model from Ollama seems to be using GGML, which groups blocks of values and rounds them to a lower precision (as opposed to using a global parameter).
A recent paper (3 weeks old!) that focuses on running Mixture-of-Experts type models on consumer hardware seems to have the answer to my misfortunes. In addition to quantization, they use one more trick in the book - offloading.
What is offloading?
Offloading is putting some parameters in a separate, cheaper memory, such as system RAM, and only load them "just-in-time" when they are needed for computation. It proves to be very suitable for inferencing and training LLMs with limited GPU memory. In the context of using Mixtral, the MoE architecture contain multiple “experts” (layers) and a “gating function” that selects which experts are used on a given input. That way the MoE block only uses a small portion of all “experts” for any single forward pass. Each expert is offloaded separately and only brought pack to GPU when needed.
How to run Mixtral-8x7B for free?
So here we are. This is by no means to run Mixtral for production use. Here is the colab notebook by the authors of the paper mentioned above. Even though they were targeting the hardware specs of the Google free-tier instances, I might just be able to run Mixtral on my laptop after all.