How to find Sentence Similarity using Transformer Embeddings : BERT vs SBERT
A computer program can pretend to take text and images as input, and give text and images as output, but on the inside it is just a number crunching machine. All that a computer understands is numbers, and all that it can do is arithmetic : addition, subtraction, multiplication and division. Yes, by just combining these four trivial operations, it is able to do all the magic that we see with software and AI!
So when you type something in chatGPT, it actually converts that text to a vector of numbers (input), on which lot of complicated operations are performed to generate another vector of numbers (output), which is then converted to text format which you can read and comprehend. These vectors of numbers that these algorithms use to all the processing are called EMBEDDINGS! These embeddings are used by modern day AI algorithms to represent text, images, videos and all other kinds of inputs.
Embeddings by themselves are not a new concept but the reason Transformers (invented in 2017 by a Google team) became so popular is that they compute these embeddings in a certain way which makes it both accurate and much faster to compute by allowing for massive parallelisation. Since then, we have seen the development of many different Transformer based models with varying capacities for various tasks. One of the most popular Transformer based models is BERT which was again introduced in 2018 by a Google team. These Deep Learning models are now popularly known as Large Language Models [LLMs].
Note that although both BERT and GPT are based on Transformers, they have significant differences in their architecture and capabilities. Without getting into murky details, remember that GPT is largely designed for text generation, whereas BERT is designed for classification, Named Entity Recognition, and other such supervised learning tasks. While you can use BERT for text generation and GPT for classification, it is generally advisable to use appropriate models for each given task. Just like you won’t hire Albert Einstein to coach your football team, and you don’t hire Lionel Messi to discover laws of physics.
So lets roll up our sleeves and get to work! In this blog, we will learn how to use BERT and its close cousin, SBERT, for finding similarity between two sentences. The process is actually very simple for sentences:
Step 1 : Find the embedding corresponding to the given sentences. And as learnt, embedding is nothing but a vector of numbers that represents the entire sentence.
Step 2 : Find the similarity between these embeddings for various sentences. In terms of vector algebra, its just about finding a dot product between the vectors.
Sentence Similarity using BERT
First we will see how to use BERT to compute the embeddings for each sentence. There are a lot of complicated processes that take place in the background when we run this model, but we can ignore all of that for the time being and just focus on how to get the desired output. So here is the code with comments. Hope it makes sense to you.
# Install the Transformer model from Hugging Face
!pip install transformers
# Import the required libraries from Hugging Face Transformers
from transformers import AutoTokenizer, AutoModel
# There are a lot of BERT based models available on HuggingFace,
# and you have to pick one that is suitable for you.
BERT_Model = "bert-base-uncased"
# Initialise the BERT Transformer model
tokenizer = AutoTokenizer.from_pretrained(BERT_Model)
model = AutoModel.from_pretrained(BERT_Model)
# Function to compute the sentence embedding using BERT
def sent_embedding(sent):
# Tokenize the sentence
# This basically converts the sentence into a sequence of tokens
# Each token is either a complete word or a sub-word
tokens = tokenizer.encode_plus(sent, max_length=128, truncation=True,
padding='max_length', return_tensors='pt')
# Now feed the tokens into the model and get the embeddings as the output
outputs = model(**tokens)
# Create an empty list to store two different kinds of embeddings
embedding_list = []
# last_hidden_state contains the output at the last hidden layer of all the sentence tokens
# pooler_output contains the embedding corresponding to only the [CLS] token, which in a way represents the whole sentence.
# This pooler_output is, however, different from the embeddings corresponding to the 1st token of last_hidden_state
# Although both represent the CLS token, the pooler_output is after some more processing,
# and more suitable for use in sentence classification tasks.
# This stores the embedding corresponding to the CLS token
embedding_list.append(outputs.last_hidden_state[0][0].detach().numpy().reshape(1,-1))
# This stores the embedding corresponding to the pooler_output
embedding_list.append(outputs.pooler_output.detach().numpy())
return embedding_list
sent1 = "I am a boy."
sent2 = "What are you doing?"
from sklearn.metrics.pairwise import cosine_similarity
# Sentence similarity using CLS token embedding
print(cosine_similarity(sent_embedding(sent1)[0],sent_embedding(sent2)[0]))
# Sentence similarity using pooler_output
print(cosine_similarity(sent_embedding(sent1)[1],sent_embedding(sent2)[1]))
The problem with this raw BERT model is that while it can give you an output, the similarity scores obtained are not very accurate. This is because it was pre-trained for a different purpose (predicting masked words). To get better accuracy for sentence similarity computation, some other folks from Germany fine-tuned this BERT model specifically for sentence similarity tasks. If you are confused about terms like pre-training and fine-tuning, check out my blog on this very important topic.
Sentence Similarity using Sentence BERT [SBERT]
This code is going to be lot more simpler than the BERT code we saw above, and also gives much better accuracy when computing similarity between two sentences. Here it is:
# Install the Sentence Transformer library
!pip install --upgrade sentence-transformers
# Import the Sentence Transformer library
from sentence_transformers import SentenceTransformer, util
# There are several different Sentence Transformer models available on Hugging Face
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sent1 = "I am a boy."
sent2 = "What are you doing?"
# Convert the sentences into embeddings using the Sentence Transformer
sent_embedding1 = model.encode(sent1,convert_to_tensor=True)
sent_embedding2 = model.encode(sent2,convert_to_tensor=True)
# Find the similarity between the two embeddings
util.pytorch_cos_sim(sent_embedding1, sent_embedding2)
Hope this made sense to you! And please do try it out with various different sentences you can think of.