Limitations of Information Retrieval using Large Language Models [LLMs]

7 min readNov 12, 2023

Information Retrieval [IR] is what defines the internet age! Google became the poster boy of the internet age in the early 2000s because of its amazing IR capabilities, and OpenAI has become the poster boy of the AI age in 2023 again because of its amazing IR capabilities. What has changed due to AI is that in response to our queries, we no longer get websites with a long text that we have to go through manually to find the actual answer. Now, thanks to Large Language Models [LLMs], we can directly get an answer using online tools like OpenAI’s ChatGPT or Google’s Bard which saves a lot of our precious time!

LLM based Information Retrieval has become indispensable because the classical IR approaches largely depend on keyword matches which is not very effective since the same question can be asked in many different ways and keywords do not capture the semantics (meaning) of the language. Demonstration of the efficacy of carefully trained vector embeddings to represent meaning has been a moment of enlightenment for the AI community! Even before the advent of ChatGPT, Google had been extensively using BERT for processing all its search queries.

While LLMs have significantly improved search technology, they still suffer from two major problems. Firstly, LLMs require a huge amount of data for pre-training and its very difficult to integrate new information into them on a regular basis. For example, OpenAI’s chatGPT has a certain cut-off date for its pre-training and this model cannot answer any question based on later information. Secondly, LLMs are prone to hallucinations. While LLMs give us amazing answers to even the most difficult questions, there is a high chance that these answers are plain wrong. LLMs are like highly paid consultants on steroids. Extremely confident and often wrong!

Retrieval Augmented Generation [RAG]

A popular solution that addresses both these problems is Retrieval Augmented Generation [RAG]. What RAG systems do is that instead of directly feeding the query to the LLM to generate an answer, they fetch relevant passages from the database and then feed the query and these passages into the LLM to generate a coherent response. This solves the problem of pre-training cut-off since these RAG systems can fetch arbitrary information from a given database, and the problem of hallucinations is also addressed since the database is expected to contain correct information.

Now, LLMs are extremely good at generating a coherent response from a given set of retrieved passages, but retrieving relevant passages from a database is still a huge challenge. A simple way to retrieve relevant passages is to again use LLM based embeddings:

Divide all the relevant documents into smaller chunks whose length depends on the context length of the chosen LLM.
Convert these chunks of text into embedding vectors using an LLM like BERT or SBERT, and store the embeddings in a vector database (eg. using pgvector extension of PostgreSQL).
For a given query, find relevant passages by doing a dot product of the query embedding with the embedding vectors of the passages. The passage with the highest dot product (close to 1) with the query vector is expected to be the most relevant. Refer to this blog on finding sentence similarity for more details of this process.

Now while this process is quite straightforward, it does not work very well in practice since the query vector and the embedding vector corresponding to the relevant passage may not have a high degree of similarity. As an example, lets consider one of the sentence transformer models available on Hugging Face and use its embeddings to compare the embedding vectors for a query with some relevant passages.

import numpy as np
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

q = "What is the capital of India?"
q_embedding = np.array(model.encode(q, convert_to_tensor=True).tolist())

text = "New Delhi is the capital of India."
text_embedding = np.array(model.encode(text, convert_to_tensor=True).tolist())

similarity_score = float(util.pytorch_cos_sim(q_embedding,text_embedding)[0][0])
print(similarity_score)

The similarity score given by this code as output is 0.752, which is not very high. If we substitute the text in the above code with “India’s capital is located in the northern part of the country.”, we get a similarity score of 0.807, which is clearly higher than 0.752 despite this text not really being the correct answer. So what do we do? Lets look at some of the proposed approaches and their limitations.

Manual Query Construction

If the database is small, a practical and highly accurate approach can be to manually construct the relevant queries for each text passage in the database. Taking the above cited example of finding the capital of India, lets say the database has the answer “New Delhi is the capital of India”. We can now manually add the query “What is the capital of India?” in the database corresponding to this passage. Now when a user asks a question in this or a related form, we can extract the answer quite accurately. For example, lets the user queries, “What is India’s capital?”. The similarity score between these two query formats using the above code is 0.995, which is certainly much higher than the similarity between the question and the relevant passage! If the user now asks, “What is Japan’s capital?”, the similarity score with your manually constructed query now drops to 0.790.

Now, this is of course not a scalable solution since manually entering queries for a large database is practically impossible. Leaving that obvious limitation aside, it is important to note that sentence embeddings are still not very robust. For example, if we just remove the question mark at the end of our query, it can significantly change the similarity score with our manually constructed query. So, if the user asks, “What is India’s capital”, the similarity score drops to 0.793! And if the user asks, “What is the capital of India”, the similarity score is 0.786. Just removing a question mark dropped the similarity score from almost 1 to around 0.79! We can address this problem to some extent by pre-processing the query , but the problem still remains.

Contrastive Fine-Tuning

BERT was the first model that demonstrated the efficacy of the transformer architecture. However, BERT embeddings were not very effective at capturing sentence similarity, and so we saw the advent of a sentence-transformer called SBERT, which is essentially the BERT model fine-tuned on a suitable large dataset to predict sentence similarity. While SBERT is significantly better at predicting sentence similarity, as we saw above, its embedding vectors can change a lot due to even minor modifications in the sentence.

In order to address this problem, several researchers have proposed further contrastive fine-tuning of various LLMs. Contrastive fine-tuning is basically an unsupervised or self-supervised learning approach where, using data augmentation, similar sentences are brought closer to each other in the embedding space, and dissimilar sentences are moved further apart. This works much better for image data since its much easier to generate similar images using various data augmentation techniques like cropping, flipping, rotation and distortion. In the case of Natural Language Processing [NLP], data augmentation is difficult but can be done to some extent using word deletion, reordering or substitutions.

Contrastive fine-tuning has been shown to significantly improve sentence embeddings, and thus, improve outcomes for information retrieval. However, this approach of using sentence similarities to find relevant passages is useful only for simple queries. For example, if the user just asks, “What is the capital of India”, this method works quite well. But if the user asks, “What is the capital of India and what is its population?”, we run into trouble since the answer is likely to be in two different passages, both of which need to be extracted before being fed into the LLM for generating the final cohesive response. This is called multi-hop information retrieval and this is where we need to go back to an old keyword based technique called Knowledge Graphs!

Knowledge Graphs + LLMs

Knowledge Graphs were invented in the 1970s to represent information using entities and relationship between them in the form of a graph structure. This has been a crucial ingredient of all the search engine algorithms and recommendation systems. An example of a Knowledge Graph relevant to our query above is shown below. The nodes of the graph represent entities or pieces of information, and the edges between them represent the relationships between these entities.

So now if the user asks, “What is the capital of India and what is its population?”, we can traverse this Knowledge Graph to first find that New Delhi is the capital of India, and its population is 32 million. But wait, how did we figure out that the user was asking for the population of New Delhi, and not the population of India? Its not obvious what the word “its” represents in the above query! Thats an inherent ambiguity in all human language and this is where LLMs come to our rescue. LLMs help in properly processing these human queries and converting them into a format called cypher that can be used to extract the answers from Knowledge Graphs.

So have we finally found a solution to all our IR problems? Not quite, since constructing these Knowledge Graphs from given text data is actually quite hard. As you might have already guessed, given some text data, we have to extract the entities in this data and the relationships between them, both of which are non-trivial tasks. LLMs do help in this process also, but it is again far from being perfect.

Conclusion

The future of Information Retrieval will most likely be a combination of LLMs, manual query construction for a subset of the database and classical models like Knowledge Graphs. This is likely to spawn a whole IR industry since a single approach is not going to work for all, and each organisation will have to develop its own tailor made approach. A lucrative approach can be to use various GPTs built on top of GPT4, but data privacy, security and even cost will always be major concerns in this approach. A cheaper and safer option may be to develop an internal IR tool using various open source LLMs. With the fast improvement in open source LLM capabilities, this will likely be the default choice for most organisations in the future.