Skip to main content

One post tagged with "multi-user"

View All Tags

· 3 min read
Isaac Chung

Being able to serve concurrent LLM generation requests are crucial to production LLM applications that have multiple users. I recently gave a talk at PyCon Lithuania on serving quantized LLMs with llama-cpp-python, an open source python library that helps serve quantized models in the GGUF format. At the end, a question came from the audience about supporting multiple users and concurrent requests. I decided to take a deeper look into why the library wasn't able to support that at the time.

Key questions I'll address are:
  • What are the challenges of serving concurrent requests with LLMs?
  • How to serve concurrent requests with quantized LLMs?