Comparative Analysis of Information Retrieval using LLMs : SBERT vs. Mistral vs. OpenAI

Kushal Shah
6 min readJul 8, 2024

Implementing information retrieval has become very easy due to the advent of Large Language Models [LLMs], but the accuracy of these algorithms is still not very high. There are many reasons for the low accuracy of these models, one of them being that the user query usually does not contain enough context for the model to find a suitable answer from the available database. Also, there are too many LLMs available in the market, due to which it can become quite difficult to decide which one to use. There are smaller open source models available through HuggingFace and there are larger closed models available as an API. It is also not clear if the larger paid models are always much better than the smaller open source models. We did an analysis using verses from the Bhagavad Gita to compare SBERT like models with Mistral and OpenAI for an information retrieval task.

Data Preparation

Bhagavad Gita has 700 verses, most of which pertains to a conversation about life in general and decision making in particular, between Krishna and Arjuna. The original text is in Sanskrit, but we are using the English translation. We have manually written the relevant questions corresponding to each verse and used this for our analysis. Each verse can be mapped to more than 1 question. And one question will always map to one single verse. This dataset is part of our larger initiative to make Indian scriptures available in a format that can be used for Data Science and Machine Learning projects.

Basic Information Retrieval

We have used a very simple information retrieval algorithm for our analysis. Given a user query, we convert it into a vector embedding using one of the LLMs, and then do a similarity search with all the 700 verses, and rank these verses based on the computed similarity scores for the given query. We compare the accuracy of several different LLMs using the manually labeled answers and also see how the results change when we introduce some errors in the user query.

Selected LLMs

We have chosen the following models for our analysis:

  1. all-MiniLM-L6-v2 [SBERT category]
  2. all-mpnet-base-v2 [SBERT category]
  3. Mistral
  4. OpenAI Small
  5. OpenAI Large
  6. Dense Passage Retrieval [DPR]
  7. Spider

Ranks based on Similarity Scores

The table below shows the number of correct verses corresponding to a user question that got Rank 1 based on our information retrieval algorithm based on embedding similarity scores. We can see that OpenAI Large gives the best performance among the chosen LLMs, but we could retrieve the correct answer for only 214 out of 885 user queries in our dataset. Another interesting observation is that much smaller SBERT like models are not far behind. In fact, OpenAI Small is only marginally better than all-MiniLM-L6-v2. And Mistral performs very similarly to OpenAI Large. A big shock was the poor performance of DPR and Spider models, which we were expecting to perform much better. For further analysis, we have dropped the DPR and Spider models.

Since we got very few correct verses to be Rank 1, we also checked how many correct verses got rank below 10 and below 20, respectively. While the numbers are surely larger than that for Rank 1, on manual analysis of the retrieved answers, we found that the larger numbers in the Rank <= 10 and Rank <= 20 columns were not meaningful. This is because when the correct verse is not Rank 1, the other retrieved verses in the top 10 or top 20 are usually not relevant to the user query. So, in other words, either the correct answer should have the highest similarity score, or it’s not possible to get a good answer for the given user query using this simple algorithm.

Query Modification Analysis

To understand how well different embedding models handle variations in questions, we introduced three types of manipulations:

  • Spelling Mistakes: A random letter in a random word of the question was swapped with the adjacent letter. This simulates typical typing errors.
  • Swapping Words: Two random words in the question were chosen and swapped.
  • Deletion of Words: A random word in the question was selected and deleted.

This helped us evaluate the robustness of the models in real-world scenarios where such errors are common. To evaluate the performance of the LLMs, we used the following two metrics:

  1. Rank = 1: The passage with the highest similarity score is ranked 1, just as described previously. This remains the primary criterion for evaluating the models.
  2. Original Match: We check if the passage with the highest similarity score before the question modification remains the same after the modification. This metric, while not crucial, provides insights into the stability of the models’ similarity scores under manipulation.

On the whole, all-MiniLM-L6-v2 seems to be most stable to these query modifications. And OpenAI seems very sensitive to word order and spelling mistakes. This is a very interesting result, and shows that in some situations a much smaller model like all-MiniLM-L6-v2 give much better performance than the OpenAI embedding model.

Across all models, spelling mistakes had the most significant impact on the similarity scores and the ability of the models to correctly rank passages. This indicates that spelling errors greatly affect the representation of questions as vectors.

Swapping words caused moderate changes in performance. Some models handled word swaps better, while others like OpenAI Large were more affected.

Deletion of Words generally caused the least changes. Some models, such as OpenAI Small and OpenAI Large, showed results similar to the original data, indicating robustness to missing words. Interestingly, all-mpnet-base-v2 performed even better with deletions.

Question Word Deletion Analysis

Finally, we wanted to see how much impact the question words like “Why”, “What”, etc have on the query embeddings. To do this, we removed all such words from the queries, found the similarity scores between the original query and the updated query and noted the number of such original-new query pairs for which the similarity score is above 0.85. Surprisingly, for all the LLMs we worked with, we found that the number of such query pairs is around 800 or more, which means that removing the question words has minimal impact on the query embeddings. This is perhaps because although removing W question words (like “how,” “what,” “why”) changes the question structure and type, the core content of the question remains identifiable. In contrast, when we removed some other Part-of-Speech tags like noun or verb, there was a significant change in the embedding vectors due to which very few original-new query pairs had a similarity score above 0.85.

Conclusion

Simple LLM based information retrieval is surely very easy to implement, but its accuracy is not very good even if we use big models from OpenAI and Mistral. One of the solutions to this problem is re-ranking of the top 10 or top 20 retrieved passages using classical algorithms, but for that also to work well, the user query needs to contain within itself more context. This is also why although Retrieval Augmented Generation [RAG] has become a popular buzzword, actually making it work with high enough accuracy is a huge challenge. RAG implementations have another challenge which is that the final answer can easily get contaminated with the knowledge inherent in the generative LLM used for output generation.

It is unlikely that this additional context will be provided by the user while typing the query, since the whole point of such applications is for the system to understand what the user wants and quickly find a reasonably good answer. And it is also unlikely for a single sentence embedding model to be able to understand the context on its own for a wide variety of domains. So then the solution seems to be for domain experts to work with LLM engineers to design information retrieval systems for specific domains using relevant domain knowledge. This can perhaps be done using a modified version of RLHF [Reinforcement Learning with Human Feedback] which is commonly used to improve performance of chatGPT like Apps.

This work was done by my intern, Gayatri K.

--

--

Kushal Shah
Kushal Shah

Written by Kushal Shah

Now faculty at Sitare University. Studied at IIT Madras, and earlier faculty at IIT Delhi. Join my online LLM course : https://www.bekushal.com/llmcourse

No responses yet