3 posts tagged with "AI"

All about Timing: A quick look at metrics for LLM serving

January 21, 2024 · 3 min read

Staff Machine Learning Scientist @ Zendesk QA

How do you measure the performance of LLM serving systems? Production services in engineering are often evaluated using metrics like Requests Per Second (RPS), uptime, and latency. In computer vision systems, Frames Per Second (FPS) is often used as the main model throughput metric for use cases that involve near-real-time detection and tracking. Does serving LLMs have something similar? Certainly. A recent conversation with my team (after they read my Ollama blog) got me thinking about additional metrics that we could be tracking.

Key questions I'll address are:

What metrics does Ollama provide?
Why can't I just use tokens per second?
What other LLM serving metrics should I consider?

When Quantized Models Still Don't Fit

January 14, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).

Key questions I'll address are:

What is quantization?
What is offloading?
How to run Mixtral-8x7B for free?

What is Ollama? A shallow dive into running LLMs locally

January 7, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

Being able to run LLMs locally and easily is truly a game changer. I have heard about Ollama before and decided to take a look at it this past weekend.

Key questions I'll address are:

Why is running LLMs locally becoming a hot thang
What is Ollama?
Should you use Ollama?