Skip to main content

3 posts tagged with "AI"

View All Tags

· 3 min read
Isaac Chung

How do you measure the performance of LLM serving systems? Production services in engineering are often evaluated using metrics like Requests Per Second (RPS), uptime, and latency. In computer vision systems, Frames Per Second (FPS) is often used as the main model throughput metric for use cases that involve near-real-time detection and tracking. Does serving LLMs have something similar? Certainly. A recent conversation with my team (after they read my Ollama blog) got me thinking about additional metrics that we could be tracking.

Key questions I'll address are:
  • What metrics does Ollama provide?
  • Why can't I just use tokens per second?
  • What other LLM serving metrics should I consider?

· 3 min read
Isaac Chung

A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).

Key questions I'll address are:
  • What is quantization?
  • What is offloading?
  • How to run Mixtral-8x7B for free?