LLMs for Question Answering and Dense Passage Retrieval [DPR]

Kushal Shah
9 min readFeb 23, 2024

--

The demand for Question Answering (QA) systems arises from the vast amount of information accessible on the internet and across multiple databases. Conventional search engines operate by utilizing keywords and presenting an extensive array of URLs linking to documents that potentially contain the answer to a user’s query, necessitating manual sifting by the user. QA systems, on the other hand, seek to simplify this procedure by furnishing pertinent answers directly in response to user questions, thereby conserving time and minimizing effort.

Question Answering (QA) systems boast a multitude of practical applications spanning diverse industries. Within the healthcare domain, they can aid medical professionals in swiftly accessing pertinent information from extensive repositories of medical literature. Within education, QA systems can function as intelligent tutors, furnishing students with immediate feedback on their inquiries. Moreover, in customer service and support roles, QA systems enable businesses to automate responses to commonly asked questions, thereby enhancing operational efficiency and customer satisfaction. Additionally, in the sphere of business and industry, QA systems have the potential to transform decision-making processes by enabling organizations to glean actionable insights from vast volumes of data, streamline operations, and foster innovation. Whether it entails forecasting market trends, recognizing patterns in consumer behavior, or optimizing supply chain logistics, QA systems serve as a potent tool for extracting value from the abundance of digital information.

Information Retrieval [IR] vs Question Answering [QA]

QA may appear analogous to Information Retrieval (IR), which is indeed a closely aligned field. IR lies between traditional search, which presents a list of documents, and QA, which furnishes precise answers to user queries. IR techniques are tailored to retrieve the particular document from the database that is most probable to contain the answer, or failing that, a small subset of documents at most.

In contemporary scenarios, Information Retrieval (IR) systems are typically integrated within Retrieval Augmented Generation (RAG) frameworks, although they can also be coupled with Question Answering (QA) systems. When presented with a user query, any Large Language Model (LLM)-based search engine must initially employ IR to identify one or a few documents that are highly relevant to the query. Subsequently, the search engine may proceed to utilize QA to extract a precise answer from the most pertinent document, or it may opt for Augmented Generation to amalgamate information from these retrieved documents and generate a suitable response. The choice between these approaches depends on the nature of the question posed and the specific objectives for which the search engine is tailored.

In terms of algorithms, IR is primarily based on similarity search in vector databases, which is a relatively simpler technique than fine-tuning LLMs to be able to retrieve the specific answer from documents. We have already covered some IR techniques in another blog, and here we are going to focus on QA systems.

Evolution of Question Answering (QA) Techniques

Throughout the years, various methodologies have emerged to address the complexities of Question Answering (QA). In its nascent stages, QA predominantly revolved around rule-based systems or algorithms centered on keyword matching. While these methodologies proved proficient within specific delimited contexts, they grappled with the intricacies inherent in natural language. Rule-based systems functioned on predefined guidelines and structures to discern pertinent data and formulate responses. Although adept in structured settings characterized by consistent question patterns, they frequently encountered challenges when confronted with linguistic ambiguity or variability. Keyword matching algorithms relied on locating exact or partial matches between user queries and pre-established keywords or phrases within a knowledge base. Despite their simplicity and computational efficiency, these algorithms were constrained by their incapacity to grasp the semantics and context of language beyond the confines of predetermined keywords.

The emergence of Machine Learning (ML) and Deep Learning marked a significant change in QA systems. ML algorithms empowered these systems to discern patterns and correlations within data, facilitating more versatile and adaptable QA methodologies. Deep learning, in particular, transformed the landscape by presenting neural network structures capable of autonomously extracting hierarchical attributes from unprocessed input data.

An essential advancement in utilizing deep learning for Question Answering (QA) emerged with the introduction of transformer architectures. The ascent of transformers marked a substantial shift in QA systems. Examples such as BERT (Bidirectional Encoder Representations from Transformers) showcased notable improvements in performance across various Natural Language Processing (NLP) tasks, including QA. Pre-trained on extensive text corpora, BERT acquired intricate contextual word representations, allowing for fine-tuning on QA-centric datasets to achieve cutting-edge performance levels.

An additional significant advancement in the progression of QA methodologies occurred with the fusion of pre-trained language models alongside tailored architectures designed for QA assignments. Innovations such as ALBERT (A Lite BERT) and RoBERTa (Robustly optimized BERT approach) brought forth refinements and enhancements to the BERT architecture, thereby contributing to notable improvements in QA effectiveness.

Apart from improvements in model architectures, the presence of extensive Question Answering (QA) datasets has been pivotal in propelling advancements in the field. Datasets such as the Stanford Question Answering Dataset (SQuAD) have offered researchers standardized benchmarks, enabling the evaluation and comparison of various QA models. This has promoted healthy competition and hastened the pace of innovation within the field.

Question-Answering Models on HuggingFace

Utilizing the BERTforQuestionAnswering class provided by HuggingFace represents the conventional approach to conducting Question Answering (QA) inference with certain fine-tuned BERT-like models. This adapted variant of BERT facilitates the prediction of both the commencement and conclusion token positions for a particular answer, given a question and its corresponding context or document.

After tokenizing a given context text into a sequence of tokens, the BERTforQuestionAnswering model processes a user question alongside these context tokens. It then predicts the probability of each context token serving as the start position and end position of the answer. Subsequently, the token with the highest probability of being the start position is selected as the beginning of the answer, and likewise, the token with the highest probability of being the end position is chosen as the conclusion of the answer.

This is significantly more complicated than BERTforSequenceClassification which simply passes the input sentence embedding to a linear classifier to predict the output.

Lets see an example using the SpanBERT model:

from transformers import BertForQuestionAnswering, AutoTokenizer
from transformers import pipeline

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("SpanBERT/spanbert-large-cased")
model = BertForQuestionAnswering.from_pretrained("mrm8488/spanbert-large-finetuned-squadv2")

# Create the pipeline for question answering
nlp = pipeline("question-answering", model=model, tokenizer=tokenizer)

question = "What is love ?"
context= "'What Is Love' is a song recorded by the artist Haddaway"

data = {'context': context, "question": question}
predictions = nlp(data)
print(predictions)

Dense Passage Retrieval (DPR) and Spider Model

While models like SpanBERT can extract the specific answer to a given question from the given context, in practical settings, one has to usually first extract the relevant context from a database of documents. This is the process of Information Retrieval [IR] as explained above. Typically, IR is done by computing the sentence embedding of a given query using an LLM, computing its cosine similarities with the sentence embeddings of all the passages in the database and then extracting the passages which have higher similarity scores. Typically, the same LLM is used to compute the sentence embeddings of the user query and the documents in the database.

DPR is a state-of-the-art IR model and in contrast to most other models, uses different models for encoding the user query and the database passages. It combines the power of dense vector representations with passage retrieval techniques. At its core, DPR uses a Dual Encoder architecture, where one encoder encodes questions, and the other encodes passages from a large corpus of documents. By efficiently encoding passages into dense vectors, DPR can quickly retrieve relevant passages for a given question, significantly improving QA performance. The Dual Encoder architecture used by DPR is very similar to the Siamese Networks widely used in face recognition and several other computer vision applications.

Now while the original DPR model gives significant improvements over earlier techniques, one of its limitations is that it still requires a relatively large amount of labeled data for training or fine tuning the algorithm. To address this issue, another DPR-like model called Spider was proposed which uses a form of self-supervised learning which the authors call recurring span retrieval. Instead of creating a labeled dataset for fine tuning the pre-trained BERT model, what Spider does is to use recurring passages from the same document as pseudo QA pairs which are then used to fine tune the BERT-base model using a contrastive loss function. The contrastive loss function is a very interesting loss function introduced by Yann LeCun and is very similar to the binary cross entropy loss function very commonly used in binary classification. The primary difference is that cross entropy is used to train the model to predict the correct class for a given input, whereas the contrastive loss is used to predict if two given inputs are similar to each other or not.

Both DPR and Spider models are available on HuggingFace. As mentioned earlier, one should use the DPRQuestionEncoder to encode the user query, and DPRContextEncoder to encode the passages, and then compute cosine similarities as explained in my blog on computing sentence similarity using BERT. Similarly, there is a QuestionEncoder and ContextEncoder for the Spider model as well.

QA using the DPR Model

Interestingly, while the DPR model is meant for IR, its HuggingFace repository also provides a companion model called DPRReader that can be used to do QA once the right context has been found. Here is how it works:

import torch
from transformers import DPRReader, DPRReaderTokenizer

model_name = "facebook/dpr-reader-single-nq-base"

tokenizer = DPRReaderTokenizer.from_pretrained(model_name)
model = DPRReader.from_pretrained(model_name)

encoded_inputs = tokenizer(
questions=["What is love ?"],
titles=["Haddaway"],
texts=["'What Is Love' is a song recorded by the artist Haddaway"],
return_tensors="pt",
)
outputs = model(**encoded_inputs)

# start_logits is a sequence of logits for each token indicating
# the likelihood that the token represents the start of the answer

# end_logits is a sequence of logits for each token indicating
# the likelihood that the token represents the end of the answer

# Normalise the start and end logits using softmax
softmax = torch.nn.Softmax(dim=-1)
start_logits = softmax(outputs.start_logits[0])
end_logits = softmax(outputs.end_logits[0])

# Get the most likely start and end positions
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)

# Extract the text span
span = encoded_inputs["input_ids"][0][start_index:end_index + 1]

# # Convert token IDs to text
answers = [tokenizer.decode(int(token)) for token in span]
print(answers)

Having a broad understanding of the DPR models, let’s have a detailed look at the SQuAD Dataset created by Stanford University for training and evaluating QA models.

The SQuAD Dataset

The Stanford Question Answering Dataset (SQuAD) stands as a cornerstone in the realm of Question Answering research, serving as a benchmark for evaluating the performance of various QA models. Its significance lies not only in its size and diversity but also in its unique characteristics that mimic real-world scenarios. Let’s delve deeper into what makes SQuAD so indispensable for QA research.

SQuAD comprises 130,319 question-answer-context triplets in the training set and 11,873 question-answer-context triplets in the validation set, meticulously curated from a vast array of Wikipedia articles, covering a wide range of topics and domains. What sets SQuAD apart is its focus on real questions posed by human annotators, ensuring that the dataset reflects the complexity and diversity of natural language queries encountered in everyday situations. The context field in the dataset provides knowledge to the model as to where to fetch the answer from and better understand the question perspective.

Interestingly, the SQuAD 2.0 dataset contains a lot of unanswerable questions! So out of 130,319 QA-context triplets, 43,498 have questions that are unanswerable from the given context. So for these 43,498 data points, the answer column is blank. Similarly, in the 11,873 validation triplets, 5945 questions are unanswerable. These unanswerable questions were introduced to provide a bigger challenge for QA models which are now required not only to answer questions that can be answered but also to figure out if the given questions cannot be answered.

SQuAD Data Preprocessing and Preparation

from datasets import load_dataset
dataset = load_dataset("squad_v2")
dataset

Output:

Now among all the features (or more naively columns) in the training set, we are only interested in 3 of them for the Question Answering system: context, question and answer.

So we get all the values of these 3 columns into 3 separate variables for direct use in the future.

contexts = dataset["train"]["context"]
questions = dataset["train"]["question"]
answers = dataset["train"]["answers"]

It’s now your task to run the various models on the SQuAD dataset and see how each of them works! All the best!

Co-written with my intern : Sakalya Mitra

--

--

Kushal Shah
Kushal Shah

Written by Kushal Shah

Now faculty at Sitare University. Studied at IIT Madras, and earlier faculty at IIT Delhi. Join my online LLM course : https://www.bekushal.com/llmcourse

No responses yet