Blog | Isaac Chung

MMTEB Massive Multilingual Text Embedding Benchmark

March 9, 2025 · 5 min read

Staff Machine Learning Scientist @ Zendesk QA

Embeddings power many AI applications we interact with — search engines, RAG systems — but how do we know if they’re actually any good? Existing benchmarks tend to focus on a narrow set of tasks, often evaluating models in isolation without considering real-world, multilingual challenges. This can make it tough to figure out which models are truly effective, and where they might fall short. That's why we need a more comprehensive way to evaluate embeddings - one that takes into account the messy, multilingual nature of real-world language use. MMTEB is designed to fill this gap, providing a broad and diverse set of evaluation tasks that can help us better understand what works, and what doesn't, in the world of embeddings.

Key questions I'll address are:

What is MMTEB?
What are the key takeaways from MMTEB?
How can I use MMTEB?

Keeping up with the Joneses? Maintaining focus in AI

May 25, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

AI is a fast moving field. This applies for both products and academia. There are new papers on NLP/LLMs/CV/ML coming out almost every day, and there are no shortage of new products and companies popping up on our social feeds. How do we know if our time is well spent focusing on one topic? Would it be better to "keep up with the Joneses" and explore every new development as they come?

This blog isn't technical. Instead, I aim to start a conversation around maintaining focus.

Key questions I'll address are:

Should we chase every new AI innovation?
Without focus, what are the chances of our research succeeding in the long term?
How do we balance research and commercial needs?

Should we chase every new AI innovation?

I can imagine that many of you are also struggling with this question. Keeping up with every new development demands a lot of time and resources, while ignoring them might leave us behind, making our knowledge obsolete. However, having a dedicated focus makes it easy to 1) dive deeper into the topic and 2) limit the need to stay current on everything. This allows for deeper progress in select areas but might mean we don't stay current with every new trend. By chasing the latest trends, we risk spreading ourselves too thin and losing focus.

Without focus, what are the chances of our research succeeding in the long term?

ML research teams usually have fairly long term focuses, which are already aligned with clearly defined commercial needs. But that's not always the case. Teams often are required to manage demands like supporting product releases and handling multiple lines of work.

But what's the impact of a lack of focus in research?

Loss of continuity: Regularly switching focus means you might not spend enough time on one topic to make substantial progress. You might not complete you work, or have time to develop as deep an understanding as you'd like. Alternatively, you might keep returning to the work and have to suffer through repeated starting phases, which are often less productive than the middle and end stages of research.
Resource dilution: Dividing your limited time and cognitive resources across multiple topics prevents deep dives into any one area. Shallow work in multiple areas is less likely to yield significant breakthroughs compared to sustained deep work in one area.
Goal fragmentation: Achieving long-term goals requires sustained effort and clear, consistent objectives. Frequent changes can fragment your goals and dilute the clarity of your research path.
Building on results: Research often builds incrementally, where each phase relies on the outcomes of the previous phase. Without cumulative progress, your research might lack the depth and evolution necessary for significant discoveries.
Team dynamics: Consistent focus areas facilitate better collaboration and synergy within your team, while frequent changes can be confusing and demotivating. This can lead to a loss in productivity and innovation.

How do we balance research and business needs?

Business needs could be a product release or a service delivery, whereas research needs could be to investigate emerging technologies and techniques that may not have immediate commercial applications.

To balance these two needs we could:

Set clear, multi-purpose goals: Try to align our research objectives with commercial goals where possible. Identify areas where advancing academic knowledge can directly contribute to product improvements or innovations.
Leverage external resources: Collaborate with academic institutions to stay on top of research while sharing the burden of exploratory work. Use conferences, publications, and peer reviews to gain feedback on our progress.
Secure support and resources: Ensure that stakeholders understand the value of long-term research and support us with appropriate resources and funding. We want to avoid research projects being overshadowed by immediate commercial demands.
Understand core competencies: Play to our strengths and understand when to outsource and find ways to compliment the teams' weaknesses.

Speed vs quality

Balance requires tradeoffs. While some may prefer to work quickly and chase multiple goals simultaneously, it's crucial to balance urgency with the need for quality and thoroughness in both research and product development. Rushed commercialization can lead to:

Incomplete solutions: Products that fail to fully address the intended problem or meet user needs.
Technical debt: Issues that accumulate over time, requiring significant resources to resolve later.
Brand reputation: Potential harm to the company’s reputation if the product performs poorly.

Conclusion

The decision to chase every new innovation or maintain focused efforts is a constant dilemma. Ultimately, finding this equilibrium is crucial to ensure that we can drive meaningful progress while escaping the pitfalls of rushed commercialization. Let me know in the comments what you think and your experience with keeping on top of the latest developments.

Serving Concurrent Requests with Quantized LLMs

April 13, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

Being able to serve concurrent LLM generation requests are crucial to production LLM applications that have multiple users. I recently gave a talk at PyCon Lithuania on serving quantized LLMs with llama-cpp-python, an open source python library that helps serve quantized models in the GGUF format. At the end, a question came from the audience about supporting multiple users and concurrent requests. I decided to take a deeper look into why the library wasn't able to support that at the time.

Key questions I'll address are:

What are the challenges of serving concurrent requests with LLMs?
How to serve concurrent requests with quantized LLMs?

How to really know if your RAG system is working well.

March 24, 2024 · 4 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

We know that building a Retrieval Augmented Generation (RAG) proof of concept is easy, but making it production-ready can be hard. There are no shortage of tips and tricks out there for us to try, but at the end of the day, it all depends on our data and our application. Transitioning RAG into production follows similar principles to other production systems. Scaling up to handle more data and users, smooth error/exception handling, and getting it to play nice with other systems are some of the main challenges to tackle. How can we really know if our RAG system is working well? and how well? To find out, we should take a look at each component under the hood and be able to evaluate the pipeline with clear metrics.

Key questions I'll address are:

How to look under the hood in a RAG system?
How to evaluate RAG systems?

All about Timing: A quick look at metrics for LLM serving

January 21, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

How do you measure the performance of LLM serving systems? Production services in engineering are often evaluated using metrics like Requests Per Second (RPS), uptime, and latency. In computer vision systems, Frames Per Second (FPS) is often used as the main model throughput metric for use cases that involve near-real-time detection and tracking. Does serving LLMs have something similar? Certainly. A recent conversation with my team (after they read my Ollama blog) got me thinking about additional metrics that we could be tracking.

Key questions I'll address are:

What metrics does Ollama provide?
Why can't I just use tokens per second?
What other LLM serving metrics should I consider?

When Quantized Models Still Don't Fit

January 14, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).

Key questions I'll address are:

What is quantization?
What is offloading?
How to run Mixtral-8x7B for free?

What is Ollama? A shallow dive into running LLMs locally

January 7, 2024 · 3 min read

Isaac Chung

Staff Machine Learning Scientist @ Zendesk QA

Being able to run LLMs locally and easily is truly a game changer. I have heard about Ollama before and decided to take a look at it this past weekend.

Key questions I'll address are:

Why is running LLMs locally becoming a hot thang
What is Ollama?
Should you use Ollama?

Should we chase every new AI innovation?​

Without focus, what are the chances of our research succeeding in the long term?​

How do we balance research and business needs?​

Speed vs quality​

Conclusion​

Should we chase every new AI innovation?

Without focus, what are the chances of our research succeeding in the long term?

How do we balance research and business needs?

Speed vs quality

Conclusion