How does GPT do in-context learning?

6 min readApr 5, 2023

A reader of my blog on Pre-training, fine-tuning and in-context learning in Large Language Models (LLMs) asked “How is in-context learning performed?” and frankly speaking, this seemingly simple question is actually very complicated! So I decided to have a conversation with the GPT-powered Bing chat to see if it can answer this well without getting into murky mathematical details. So here is the transcript of the conversation:

Me :
How does in-context learning work for GPT like models?

Bing Chat :
In-context learning is a way to use language models like GPT to learn tasks given only a few examples1. The model receives a prompt that consists of input-output pairs that demonstrate a task, and then tries to complete a new input based on the previous examples2. For example, if you want to use GPT to translate English to French, you can give it some sentences in both languages as examples, and then ask it to translate a new sentence.

Are you interested in learning more about how in-context learning works or how to use it for different tasks?

Me :
I want to understand how exactly in-context learning works with all details. I understand pre-training and fine tuning of GPT like models, but in-context learning seems very confusing and vague. What exactly happens in this process?

Bing Chat :
In-context learning is a mysterious emergent behavior in large language models (LMs) where the LM performs a task just by conditioning on input-output examples, without optimizing any parameters1. The model uses its pre-trained knowledge and generalizes to new tasks by adapting its predictions based on the examples2. The model does not explicitly learn any new parameters or rules, but rather implicitly fine-tunes its internal representations based on the context3.

Do you want to know more about how to design effective prompts for in-context learning or how to evaluate its performance?

Me :
Can you summarise the first article by stanford you recommended in the URLs for me in fewer words?

Bing Chat :
The article by Stanford explains in-context learning as a way to use language models like GPT-3 to perform new tasks by giving them a few examples. The article proposes a Bayesian inference framework to understand how the model adapts its predictions based on the examples. The article also shows some experiments that demonstrate the effectiveness and limitations of in-context learning.

Is this summary helpful? Do you have any questions about it?

Me :
What is this bayesian inference framework and how does it work? Kindly tell me more details. Treat me as your student.

Bing Chat :
Sure, I will try to explain it as simply as possible. Bayesian inference is a way of updating our beliefs based on new evidence. For example, if we want to know if it will rain tomorrow, we can start with some prior probability based on our knowledge or experience, and then update it based on new information, such as weather forecasts or cloud observations.

In the context of in-context learning, Bayesian inference means that the language model uses the examples in the prompt to infer a latent concept that describes the task. For example, if we give the model some examples of translating English to French, the model will infer that the latent concept is translation and will use its pre-trained knowledge to perform translation for new inputs. The model does not explicitly learn any new parameters or rules, but rather adjusts its internal representations based on the evidence in the prompt.

Does this make sense? Do you want me to give you an example of how Bayesian inference works in practice?

Me :
Yes, please give me an example.

Bing Chat :
Okay, let’s say we want to use a language model to classify movie reviews as positive or negative. We can start by giving the model a few examples of reviews and their labels, like this:

Review: I loved this movie. It was hilarious and heartwarming. Label: Positive

Review: This movie was terrible. It was boring and predictable. Label: Negative

Review: It was an okay movie. Nothing too special. Label: ???

The model will use these examples to infer that the latent concept is sentiment analysis, and will use its pre-trained knowledge to predict the label for the last review. The model will assign a probability to each possible label based on how likely it is given the evidence in the prompt. For example, the model might assign a probability of 0.8 to positive, 0.1 to negative, and 0.1 to neutral. The model will then choose the label with the highest probability, which is positive in this case.

This is how Bayesian inference works in in-context learning. The model updates its beliefs about the task based on the examples in the prompt, and uses those beliefs to make predictions for new inputs.

Did this example help you understand Bayesian inference better?

Me :
Can you please explain this process from the perspective of the layers and weights of the GPT model? How are these prompts represented inside the GPT model which helps it in making this inference?

Bing Chat :
That’s a good question. The GPT model is composed of several layers of transformers, which are neural networks that process sequences of tokens. Each token is a piece of text, such as a word or a character. The model assigns a numerical vector to each token, called an embedding, which represents its meaning and context. The model also has weights, which are parameters that determine how the embeddings are transformed and combined by the layers.

The prompts are represented as sequences of tokens that are fed into the model as inputs. The model processes these tokens layer by layer, updating their embeddings based on the weights and the attention mechanism, which allows the model to focus on different parts of the sequence. The final layer produces an output token, which is the prediction of the model.

The prompts help the model in making inference by providing clues about the task and the expected output format. The model uses these clues to adjust its internal representations and weights to match the latent concept of the task. For example, if the prompt contains examples of translation, the model will use its pre-trained knowledge of languages and grammar to modify its embeddings and weights to perform translation better.

Does this answer your question? Do you want me to show you how to visualize the embeddings and weights of the GPT model?

Me :
But there is no fine tuning happening. So what does it mean to update the embeddings and weights of the model?

Bing Chat :
You are right, there is no fine tuning happening in in-context learning. The model does not change its weights permanently, but rather temporarily adjusts them based on the prompt. This is called implicit Bayesian inference, which means that the model infers the most likely weights for the task given the evidence in the prompt. The model does this by using a prior distribution over the weights, which is learned from the pre-training data, and a likelihood function, which measures how well the weights fit the prompt. The model then computes a posterior distribution over the weights, which is the updated belief based on the prior and the likelihood. The model uses this posterior distribution to make predictions for new inputs.

The embeddings are also updated based on the prompt, but not by changing their values directly. Rather, the model changes how it attends to different tokens in the sequence, which affects how their embeddings are transformed and combined by the layers. The model learns to attend to tokens that are relevant for the task and ignore those that are not. For example, if the prompt contains examples of sentiment analysis, the model will learn to attend to tokens that express emotions and opinions, and ignore those that are neutral or irrelevant.

Does this clarify your doubt? Do you want me to explain how attention works in GPT model?

How does GPT do in-context learning?

Written by Kushal Shah

No responses yet