Embeddings power many AI applications we interact with — search engines, RAG systems — but how do we know if they’re actually any good? Existing benchmarks tend to focus on a narrow set of tasks, often evaluating models in isolation without considering real-world, multilingual challenges. This can make it tough to figure out which models are truly effective, and where they might fall short. That's why we need a more comprehensive way to evaluate embeddings - one that takes into account the messy, multilingual nature of real-world language use. MMTEB is designed to fill this gap, providing a broad and diverse set of evaluation tasks that can help us better understand what works, and what doesn't, in the world of embeddings.
AI is a fast moving field. This applies for both products and academia. There are new papers on NLP/LLMs/CV/ML coming out almost every day, and there are no shortage of new products and companies popping up on our social feeds. How do we know if our time is well spent focusing on one topic? Would it be better to "keep up with the Joneses" and explore every new development as they come?
This blog isn't technical. Instead, I aim to start a conversation around maintaining focus.
Key questions I'll address are:
Should we chase every new AI innovation?
Without focus, what are the chances of our research succeeding in the long term?
I can imagine that many of you are also struggling with this question. Keeping up with every new development demands a lot of time and resources, while ignoring them might leave us behind, making our knowledge obsolete. However, having a dedicated focus makes it easy to 1) dive deeper into the topic and 2) limit the need to stay current on everything. This allows for deeper progress in select areas but might mean we don't stay current with every new trend. By chasing the latest trends, we risk spreading ourselves too thin and losing focus.
Without focus, what are the chances of our research succeeding in the long term?
ML research teams usually have fairly long term focuses, which are already aligned with clearly defined commercial needs. But that's not always the case. Teams often are required to manage demands like supporting product releases and handling multiple lines of work.
But what's the impact of a lack of focus in research?
Loss of continuity: Regularly switching focus means you might not spend enough time on one topic to make substantial progress. You might not complete you work, or have time to develop as deep an understanding as you'd like. Alternatively, you might keep returning to the work and have to suffer through repeated starting phases, which are often less productive than the middle and end stages of research.
Resource dilution: Dividing your limited time and cognitive resources across multiple topics prevents deep dives into any one area. Shallow work in multiple areas is less likely to yield significant breakthroughs compared to sustained deep work in one area.
Goal fragmentation: Achieving long-term goals requires sustained effort and clear, consistent objectives. Frequent changes can fragment your goals and dilute the clarity of your research path.
Building on results: Research often builds incrementally, where each phase relies on the outcomes of the previous phase. Without cumulative progress, your research might lack the depth and evolution necessary for significant discoveries.
Team dynamics: Consistent focus areas facilitate better collaboration and synergy within your team, while frequent changes can be confusing and demotivating. This can lead to a loss in productivity and innovation.
Business needs could be a product release or a service delivery, whereas research needs could be to investigate emerging technologies and techniques that may not have immediate commercial applications.
To balance these two needs we could:
Set clear, multi-purpose goals: Try to align our research objectives with commercial goals where possible. Identify areas where advancing academic knowledge can directly contribute to product improvements or innovations.
Leverage external resources: Collaborate with academic institutions to stay on top of research while sharing the burden of exploratory work. Use conferences, publications, and peer reviews to gain feedback on our progress.
Secure support and resources: Ensure that stakeholders understand the value of long-term research and support us with appropriate resources and funding. We want to avoid research projects being overshadowed by immediate commercial demands.
Understand core competencies: Play to our strengths and understand when to outsource and find ways to compliment the teams' weaknesses.
Balance requires tradeoffs. While some may prefer to work quickly and chase multiple goals simultaneously, it's crucial to balance urgency with the need for quality and thoroughness in both research and product development. Rushed commercialization can lead to:
Incomplete solutions: Products that fail to fully address the intended problem or meet user needs.
Technical debt: Issues that accumulate over time, requiring significant resources to resolve later.
Brand reputation: Potential harm to the company’s reputation if the product performs poorly.
The decision to chase every new innovation or maintain focused efforts is a constant dilemma.
Ultimately, finding this equilibrium is crucial to ensure that we can drive meaningful progress while escaping the pitfalls of rushed commercialization. Let me know in the comments what you think and your experience with keeping on top of the latest developments.
Being able to serve concurrent LLM generation requests are crucial to production LLM applications that have multiple users. I recently gave a talk at PyCon Lithuania on serving quantized LLMs with llama-cpp-python, an open source python library that helps serve quantized models in the GGUF format. At the end, a question came from the audience about supporting multiple users and concurrent requests. I decided to take a deeper look into why the library wasn't able to support that at the time.
Key questions I'll address are:
What are the challenges of serving concurrent requests with LLMs?
How to serve concurrent requests with quantized LLMs?
We know that building a Retrieval Augmented Generation (RAG) proof of concept is easy, but making it production-ready can be hard. There are no shortage of tips and tricks out there for us to try, but at the end of the day, it all depends on our data and our application. Transitioning RAG into production follows similar principles to other production systems. Scaling up to handle more data and users, smooth error/exception handling, and getting it to play nice with other systems are some of the main challenges to tackle. How can we really know if our RAG system is working well? and how well? To find out, we should take a look at each component under the hood and be able to evaluate the pipeline with clear metrics.
How do you measure the performance of LLM serving systems? Production services in engineering are often evaluated using metrics like Requests Per Second (RPS), uptime, and latency. In computer vision systems, Frames Per Second (FPS) is often used as the main model throughput metric for use cases that involve near-real-time detection and tracking. Does serving LLMs have something similar? Certainly. A recent conversation with my team (after they read my Ollama blog) got me thinking about additional metrics that we could be tracking.
A key ingredient to running LLMs locally (read: without high-end GPUs, as in multiple) is quantization. What do you do when the 4-bit quantized model is still too big for your machine? That's what happened to me when I was trying to run Mixtral-8x7B with Ollama (check out this previous blog post on what Ollama is). The model requires 26GB of RAM while my laptop only has 16GB. I'll try to walk through the workaround a bit at a time (pun intended).
Being able to run LLMs locally and easily is truly a game changer. I have heard about Ollama before and decided to take a look at it this past weekend.