One post tagged with "requests"

Serving Concurrent Requests with Quantized LLMs

April 13, 2024 · 3 min read

Staff Machine Learning Scientist @ Zendesk QA

Being able to serve concurrent LLM generation requests are crucial to production LLM applications that have multiple users. I recently gave a talk at PyCon Lithuania on serving quantized LLMs with llama-cpp-python, an open source python library that helps serve quantized models in the GGUF format. At the end, a question came from the audience about supporting multiple users and concurrent requests. I decided to take a deeper look into why the library wasn't able to support that at the time.

Key questions I'll address are:

What are the challenges of serving concurrent requests with LLMs?
How to serve concurrent requests with quantized LLMs?