What the ____ are Large Language Models?
Natural Language Processing [NLP] is an old field, almost as old as modern computing. In fact, what led to the discovery of Turing Machines and the subsequent computing revolution was Alan Turing’s quest to decipher German messages during World War II. Now after the invention of modern computers, although many other CS related fields progressed rapidly, NLP lagged behind. Despite some early successes with hard problems, there was many decades of stagnancy since the conventional statistics based algorithms were not good enough for harder and more practically relevant problems.
We needed an algorithm that actually “understands” natural human language at a deeper level instead of just relying on frequency counts of letters and words.
Fill In The Blanks!
Now, “understanding” is a very deep concept and its hard to define what it actually means. But from a computational perspective, it basically means the ability of a computational model to predict missing words in a sentence. Remember the Fill in the Blanks that your English teacher made you do in school? S/he was basically training your brain to become a Language Model! So how would you fill in the blank in the sentence below:
What the ____ are Large Language Models?
I don’t know about you, but I would fill it with “heck” or “hell”. No offensive words please!
So how do machines learn to do this fill in the blanks? Earlier, people built statistical models that would learn the joint probability of words depending on the preceding words in a sentence. Essentially, using a large volume of text data, these algorithms would try to estimate the probability (normalised frequency) of words like “heck” or “hell” occurring after the bi-gram “What the”, and then use these estimated probabilities to fill in the blanks in new sentences. While this approach was a good beginning, it did not give very promising results.
A major breakthrough was made with the advent of Word Embeddings computed using various Neural Network architectures.
Words as Vectors of Numbers
A computer operates on numbers and numbers alone, and a piece of text means nothing to it. So whatever we wish to do with computers has to be represented in the form of numbers. In olden days, NLP was done using statistical calculation of letter and word frequencies. But since that approach did not lead to great advancements, scientists started thinking of other ways to represent words.
Finally in 2013, a team at Google came up with a breakthrough idea called Word2Vec where words were represented as a “dense” vector of floating point (i.e. decimal) numbers.
Whats “dense” here? So basically, the idea of representing words as “sparse” vectors had been in use since the 1960s and 1970s, but there the vectors represented word occurrences and word frequencies, and so had lots of zeros in them. The novelty of Word2Vec was to use “dense” vectors where every element of the vector was non-zero and computed using a neural network instead of simple statistical calculations.
Every word in Word2Vec was represented by a unique vector, and words which occurred in similar contexts had vector representations which were closer to each other. So for example, words like “king” and “queen” would be closer than “king” and “horse”. And how do we measure this close-ness? By using cosine similarity. Since the vectors are normalised (have unit length), just take a dot product between them.
Contextual Word Embeddings
While Word2Vec was a revolutionary idea, a major problem with it was that each word in it had a fixed vector representation.
Now that may be fine for words which have the same meaning in all contexts, but it poses problems for words like “bat” or “watch” which can mean different things depending on the whole sentence. This problem got solved a few years later with the advent of the ELMo model released in 2018 by Allen Institute for Artificial Intelligence. This model was based on LSTMs [Long Short Term Memory], which is a modified version of Recurrent Neural Networks [RNNs].
In the ELMo model, words don’t have fixed vector embeddings, and these vector embeddings are computed whenever required for a given sentence.
So the text has to be passed through the ELMo model for the word embeddings (i.e. vectors) to be generated. For example, in the Word2Vec model, the word “bat” would have a fixed vector embedding no matter what it means in a given sentence. But in ELMo, the embedding for “bat” would depend on the sentence in which it occurs and is not pre-determined.
Large Language Models
While ELMo was a significant advance over Word2Vec, a major problem with LSTM based models is that they are hard to scale. Due to their sequential nature, LSTM computations are expensive and time consuming, due to which we can’t really make the models as big as we want. They also suffer from other computational problems which limit their scaling capabilities.
In 2017, Google invented a new architecture called Transformers, which finally provided a solution that was both scalable and accurate.
And the models based on these Transformers are called Large Language Models since they usually have parameters ranging from hundred million (BERT and GPT 1) to one trillion (GPT 4). Some people like to call models like BERT as Small Language Models [SLMs], but I think they are just being miserly. Lets give the LLM tag to anything as big or bigger than BERT.
So these LLMs also produce word embeddings and like ELMo, these embeddings are also contextual, i.e. they depend on the sentence in which the word occurs. But these LLM embeddings are far more accurate than ELMo and practically the whole field of NLP has now become LLM-centric.
Just Making Sure You Get It Right!
So before you leave : What the ____ are Large Language Models?
These are just algorithms that can accurately predict missing words in a sentence. In the case of models like BERT, we are talking about missing words in between, like you did in school. And with generative models like GPT or LLaMA, we are talking about predicting the next words for a given user input (prompt), just like writing an essay for a given topic. The basic idea remains the same.
How do I compute these word embeddings?
If you wish to use these LLMs to compute word embeddings for some sentences and compute sentence similarity, check out my blog, which only requires basic knowledge of python programming:
How to find Sentence Similarity using Transformer Embeddings : BERT vs SBERT