Skip to main content

5 posts tagged with "LLM"

View All Tags

· 3 min read
Isaac Chung

Being able to serve concurrent LLM generation requests are crucial to production LLM applications that have multiple users. I recently gave a talk at PyCon Lithuania on serving quantized LLMs with llama-cpp-python, an open source python library that helps serve quantized models in the GGUF format. At the end, a question came from the audience about supporting multiple users and concurrent requests. I decided to take a deeper look into why the library wasn't able to support that at the time.

Key questions I'll address are:
  • What are the challenges of serving concurrent requests with LLMs?
  • How to serve concurrent requests with quantized LLMs?

· 4 min read
Isaac Chung

We know that building a Retrieval Augmented Generation (RAG) proof of concept is easy, but making it production-ready can be hard. There are no shortage of tips and tricks out there for us to try, but at the end of the day, it all depends on our data and our application. Transitioning RAG into production follows similar principles to other production systems. Scaling up to handle more data and users, smooth error/exception handling, and getting it to play nice with other systems are some of the main challenges to tackle. How can we really know if our RAG system is working well? and how well? To find out, we should take a look at each component under the hood and be able to evaluate the pipeline with clear metrics.

Key questions I'll address are:
  • How to look under the hood in a RAG system?
  • How to evaluate RAG systems?

· 3 min read
Isaac Chung

How do you measure the performance of LLM serving systems? Production services in engineering are often evaluated using metrics like Requests Per Second (RPS), uptime, and latency. In computer vision systems, Frames Per Second (FPS) is often used as the main model throughput metric for use cases that involve near-real-time detection and tracking. Does serving LLMs have something similar? Certainly. A recent conversation with my team (after they read my Ollama blog) got me thinking about additional metrics that we could be tracking.

Key questions I'll address are:
  • What metrics does Ollama provide?
  • Why can't I just use tokens per second?
  • What other LLM serving metrics should I consider?

· 3 min read
Isaac Chung

A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).

Key questions I'll address are:
  • What is quantization?
  • What is offloading?
  • How to run Mixtral-8x7B for free?