5 posts tagged with "LLM"

Serving Concurrent Requests with Quantized LLMs

April 13, 2024 · 3 min read

Staff Machine Learning Scientist @ Zendesk QA

Being able to serve concurrent LLM generation requests are crucial to production LLM applications that have multiple users. I recently gave a talk at PyCon Lithuania on serving quantized LLMs with llama-cpp-python, an open source python library that helps serve quantized models in the GGUF format. At the end, a question came from the audience about supporting multiple users and concurrent requests. I decided to take a deeper look into why the library wasn't able to support that at the time.

Key questions I'll address are:

What are the challenges of serving concurrent requests with LLMs?
How to serve concurrent requests with quantized LLMs?

How to really know if your RAG system is working well.

March 24, 2024 · 4 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

We know that building a Retrieval Augmented Generation (RAG) proof of concept is easy, but making it production-ready can be hard. There are no shortage of tips and tricks out there for us to try, but at the end of the day, it all depends on our data and our application. Transitioning RAG into production follows similar principles to other production systems. Scaling up to handle more data and users, smooth error/exception handling, and getting it to play nice with other systems are some of the main challenges to tackle. How can we really know if our RAG system is working well? and how well? To find out, we should take a look at each component under the hood and be able to evaluate the pipeline with clear metrics.

Key questions I'll address are:

How to look under the hood in a RAG system?
How to evaluate RAG systems?

All about Timing: A quick look at metrics for LLM serving

January 21, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

How do you measure the performance of LLM serving systems? Production services in engineering are often evaluated using metrics like Requests Per Second (RPS), uptime, and latency. In computer vision systems, Frames Per Second (FPS) is often used as the main model throughput metric for use cases that involve near-real-time detection and tracking. Does serving LLMs have something similar? Certainly. A recent conversation with my team (after they read my Ollama blog) got me thinking about additional metrics that we could be tracking.

Key questions I'll address are:

What metrics does Ollama provide?
Why can't I just use tokens per second?
What other LLM serving metrics should I consider?

When Quantized Models Still Don't Fit

January 14, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).

Key questions I'll address are:

What is quantization?
What is offloading?
How to run Mixtral-8x7B for free?

What is Ollama? A shallow dive into running LLMs locally

January 7, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

Being able to run LLMs locally and easily is truly a game changer. I have heard about Ollama before and decided to take a look at it this past weekend.

Key questions I'll address are:

Why is running LLMs locally becoming a hot thang
What is Ollama?
Should you use Ollama?