$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} \newcommand{\dd}{\operatorname{d}\!} $$

Transformers

This chapter introduces transformers, initially targeted at natural language processing (NLP) problems.
The network input is a series of high-dimensional embeddings representing words or word fragments.
Language datasets share some characteristics of image data:
- The number of input variables can be very large.
- Statistics are similar at every position; it’s not sensible to re-learn the meaning of the word dog at every possible position.
However, text sequences vary in length, and unlike images, there is no easy way to resize them.

Processing text data

To motivate the transformer, consider the following passage:

The restaurant refused to serve me a ham sandwich because it only cooks vegetarian food. In the end, they just gave me two slices of bread. Their ambiance was just as good as the food and service.

The goal is to design a network to process this text into a representation suitable for downstream tasks, such as:

Sentiment analysis: Classifying the review as positive or negative.
Question answering: e.g., “Does the restaurant serve steak?”.

We can make three immediate observations:

Large Encoded Input:
- Even a small 37-word passage with embedding vectors of length 1024 results in $37 \times 1024 = 37,888$ units.
- For realistic datasets with thousands of words, fully connected layers become impractical.
Variable Input Lengths:
- Sentences vary in length, making it difficult to apply a standard fully connected network.
- Solution: The network should share parameters across different input positions, analogous to how CNNs share parameters across image regions.
Ambiguity and Attention:
- Language is inherently ambiguous. For instance, the pronoun it in the first sentence refers to the restaurant, not the ham sandwich.
- To resolve this, the word it must be connected to restaurant.
- Attention: In transformer parlance, the word pays “attention” to related words. These connections must extend across large text spans (e.g., their also refers back to the restaurant).

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

Transformers use dot-product self-attention to achieve two critical properties:

Parameter sharing: Handling long inputs of varying lengths.
Context-dependent connections: Linking word representations based on the words themselves.

Mechanics of Self-Attention

A self-attention block $\text{sa}[\bullet]$ processes $N$ inputs $\mathbf{x}_1, \dots, \mathbf{x}_N$ (each $D \times 1$) and returns $N$ outputs of the same dimension.

Step 1: Compute Values For each input $\mathbf{x}_m$, compute a value vector $\mathbf{v}_m$: \[\mathbf{v}_m = \mathbf{\beta}_v + \mathbf{\Omega}_v \mathbf{x}_m\] where $\mathbf{\beta}_v$ and $\mathbf{\Omega}_v$ are learnable biases and weights.
Step 2: Weighted Summation (Routing) The $n^{th}$ output is a weighted sum of all values: \[\text{sa}_n[\mathbf{x}_1, \dots, \mathbf{x}_N] = \sum_{m=1}^N a[\mathbf{x}_m, \mathbf{x}_n] \mathbf{v}_m\]
- The scalar weight $a[\mathbf{x}_m, \mathbf{x}_n]$ represents the attention the $n^{th}$ output pays to input $\mathbf{x}_m$.
- These weights are non-negative and sum to one ($\sum_m a[\mathbf{x}_m, \mathbf{x}_n] = 1$).
- This effectively routes the values $\mathbf{v}_m$ in different proportions to create each output (as seen in Figure 1).

Computing and weighting values

The self-attention mechanism is efficient and scalable:

Computing Values (Linear Scaling):
- The same weights $\mathbf{\Omega}_v \in \mathbb{R}^{D \times D}$ and biases $\mathbf{\beta}_v \in \mathbb{R}^D$ are applied to each input $\mathbf{x}_m$.
- Scalability: This computation scales linearly with the sequence length $N$.
- Compared to a fully connected network (which would relate all $DN$ inputs to all $DN$ values), this uses far fewer parameters.
- This can be viewed as a sparse matrix operation with shared parameters.
Attention Weights (Quadratic Scaling):
- Attention weights $\mathbf{a}[\mathbf{x}_m, \mathbf{x}_n]$ combine values from across the sequence.
- Sparsity: There is only one weight for each ordered pair $(\mathbf{x}_m, \mathbf{x}_n)$, regardless of the vector dimension $D$.
- Scalability: The number of attention weights depends quadratically on sequence length $N$, but is independent of input dimension $D$.

Computing attention weights

Self-attention is an example of a hypernetwork, where one part of the network computes the weights used by another.

Nonlinearity: Although values are combined linearly, the overall mechanism is nonlinear because the attention weights are nonlinear functions of the input.
Queries and Keys: We apply two additional linear transformations to the inputs:
- Queries: $\mathbf{q}_n = \mathbf{\beta}_q + \mathbf{\Omega}_q \mathbf{x}_n$
- Keys: $\mathbf{k}_m = \mathbf{\beta}_k + \mathbf{\Omega}_k \mathbf{x}_m$
Dot-Product Attention:
- We compute the similarity between queries and keys using dot products.
- The weights are normalized via the softmax function so they are positive and sum to one: \[a[\mathbf{x}_m, \mathbf{x}_n] = \frac{\exp[\mathbf{k}_m^T \mathbf{q}_n]}{\sum_{m'=1}^N \exp[\mathbf{k}_{m'}^T \mathbf{q}_n]}\]
Interpretation:
- The weights depend on the relative entry-wise similarity between the $n^{th}$ query and all keys.
- The softmax causes the keys to compete to contribute to the final result.
Dimensions:
- Queries and keys must share the same dimension.
- This dimension can differ from the value vectors (which usually match the input dimension $D$).

Self-attention summary

The self-attention mechanism provides a flexible way to process sequences:

Output Computation:
- Each output is a weighted sum of linear transformations of all inputs: $\mathbf{v}_m = \mathbf{\beta}_v + \mathbf{\Omega}_v \mathbf{x}_m$.
- The weights are positive, sum to one, and represent the similarity between the $n^{th}$ input and all others.
Nonlinearity:
- There is no explicit activation function.
- Nonlinearity arises from the dot-product and softmax operations used to determine the weights.
Key Advantages:
- Shared Parameters: The mechanism uses a single set of learnable parameters $\phi = \{\mathbf{\beta}_v, \mathbf{\Omega}_v, \mathbf{\beta}_q, \mathbf{\Omega}_q, \mathbf{\beta}_k, \mathbf{\Omega}_k\}$, making it independent of sequence length $N$.
- Dynamic Connections: Connections between inputs (e.g., words) are determined dynamically based on the inputs themselves via the attention weights.

Matrix form

The self-attention computation can be expressed compactly using matrix operations:

Matrix Definition:
- Store $N$ input vectors $\mathbf{x}_n$ as columns of a $D \times N$ matrix $\mathbf{X}$.
Computing Values, Queries, and Keys:
- The matrices $\mathbf{V}$, $\mathbf{Q}$, and $\mathbf{K}$ are computed as: \[\mathbf{V}[\mathbf{X}] = \mathbf{\beta}_v \mathbf{1}^T + \mathbf{\Omega}_v \mathbf{X}\] \[\mathbf{Q}[\mathbf{X}] = \mathbf{\beta}_q \mathbf{1}^T + \mathbf{\Omega}_q \mathbf{X}\] \[\mathbf{K}[\mathbf{X}] = \mathbf{\beta}_k \mathbf{1}^T + \mathbf{\Omega}_k \mathbf{X}\] where $\mathbf{1}$ is an $N \times 1$ vector of ones.
Self-Attention Computation:
- The final output matrix $\mathbf{Sa}[\mathbf{X}]$ is: \[\mathbf{Sa}[\mathbf{X}] = \mathbf{V}[\mathbf{X}] \cdot \text{Softmax}\left[\mathbf{K}[\mathbf{X}]^T \mathbf{Q}[\mathbf{X}]\right]\]
- The Softmax function is applied independently to each column of the product matrix.
Simplified Notation:
- To emphasize that the output is a “triple product” of the inputs, we often drop the explicit dependence on $\mathbf{X}$ and write: \[\mathbf{Sa}[\mathbf{X}] = \mathbf{V} \cdot \text{Softmax}\left[\mathbf{K}^T \mathbf{Q}\right]\]

Positional encoding

The self-attention mechanism is equivariant to input permutations, meaning it does not naturally account for the order of inputs $\mathbf{x}_n$. However, in language, sequence order is essential (e.g., “The woman ate the raccoon” vs “The raccoon ate the woman”). To address this, we use positional encodings:

Absolute Positional Encodings:
- A unique positional matrix $\mathbf{\Pi}$ is added to the input matrix $\mathbf{X}$ (as illustrated in Figure 5).
- Each column of $\mathbf{\Pi}$ is unique and encodes the absolute position within the sequence.
- These can be predefined (hand-crafted) or learned parameters.
- They are typically added at the input or integrated into the computation of queries and keys.
Relative Positional Encodings:
- For many tasks, the relative position between words is more important than their absolute location.
- This information is encoded directly into the attention matrix based on the offset between key position $a$ and query position $b$.
- The model learns a parameter $\pi_{a,b}$ for each offset to adjust the attention weights accordingly.

Scaled dot-product self-attention

The Problem: Dot products in the attention computation can have large magnitudes, pushing the softmax function into regions with near-zero gradients.
The Solution: Scale the dot products by the square root of the query/key dimension $D_q$ to maintain stable gradients during training.
Formula: \[\mathbf{Sa}[\mathbf{X}] = \mathbf{V} \cdot \text{Softmax}\left[\frac{\mathbf{K}^T \mathbf{Q}}{\sqrt{D_q}}\right]\]
This is known as scaled dot-product self-attention.

Multiple heads

Parallel Mechanisms: Multiple self-attention units (heads) are applied in parallel to capture different types of sequence relationships.
Unique Parameters: Each head $h$ has its own learnable weights and biases:
- $\mathbf{V}_h = \mathbf{\beta}_{vh} \mathbf{1}^T + \mathbf{\Omega}_{vh} \mathbf{X}$
- $\mathbf{Q}_h = \mathbf{\beta}_{qh} \mathbf{1}^T + \mathbf{\Omega}_{qh} \mathbf{X}$
- $\mathbf{K}_h = \mathbf{\beta}_{kh} \mathbf{1}^T + \mathbf{\Omega}_{kh} \mathbf{X}$
Efficiency: For $H$ heads, each typically operates on a subspace of dimension $D/H$.
Aggregation: The resulting head outputs $\mathbf{Sa}_h[\mathbf{X}]$ are vertically concatenated and combined via a final linear transform $\mathbf{\Omega}_c$ (see Figure 6): \[\mathbf{MhSa}[\mathbf{X}] = \mathbf{\Omega}_c \left[ \mathbf{Sa}_1[\mathbf{X}]^T, \mathbf{Sa}_2[\mathbf{X}]^T, \dots, \mathbf{Sa}_H[\mathbf{X}]^T \right]^T\]
Benefits: Multi-head attention is crucial for performance and increases robustness to initialization.

Self-attention is just one part of a larger transformer layer. This block combines sequence-wide context with point-wise processing:

Multi-Head Self-Attention: Allows word representations to interact and exchange information across the entire sequence.
MLP Network: A fully connected network mlp$[\mathbf{x}_n]$ that operates independently on each position.
Residual Connections: Both units use skip connections, adding the output back to the original input to improve gradient flow.
Layer Normalization:
- A LayerNorm operation is applied after both the self-attention and MLP units.
- Unlike BatchNorm, it normalizes each embedding using statistics calculated across its own $D$ dimensions.

The complete layer follows this sequence of operations (see Figure 7):

\[\mathbf{X} \leftarrow \mathbf{X} + \mathbf{MhSa}[\mathbf{X}]\] \[\mathbf{X} \leftarrow \text{LayerNorm}[\mathbf{X}]\] \[\mathbf{x}_n \leftarrow \mathbf{x}_n + \text{mlp}[\mathbf{x}_n], \quad \forall n \in \{1,\dots,N\}\] \[\mathbf{X} \leftarrow \text{LayerNorm}[\mathbf{X}],\]

where $\mathbf{x}_n$ are the column vectors of the data matrix $\mathbf{X}$.

The previous section described the transformer layer. Now we consider its application to Natural Language Processing (NLP) tasks:

Pipeline: A typical NLP pipeline starts with a tokenizer to split text into manageable units.
Embeddings: Each token is mapped to a learned embedding vector.
Processing: These embeddings pass through a series of transformer layers to capture context.

Tokenization

A text processing pipeline begins with a tokenizer. This unit splits text into constituent units called tokens from a predefined vocabulary.

Challenges with Word-Level Tokens:
- Out-of-Vocabulary (OOV): Names or rare words may not be in the vocabulary.
- Punctuation: Handling punctuation is critical for meaning but difficult with simple word splits.
- Morphological Variations: Suffixes (e.g., walk, walks, walking) are treated as unrelated tokens, losing morphological links.
The Compromise: Sub-word Tokenization:
- Modern models use sub-word tokenizers (e.g., byte pair encoding).
- This approach includes both common words and frequent word fragments (see Figure 8).
- It merges commonly occurring sub-strings based on frequency, allowing the model to handle rare words as combinations of known sub-words.

Learned embeddings

Once the sequence of $N$ tokens is identified, each token is mapped to a unique learned embedding:

Vocabulary Matrix: The embeddings for the whole vocabulary $\mathcal{V}$ are stored in a matrix $\mathbf{\Omega}_e \in \mathbb{R}^{D \times |\mathcal{V}|}$.
One-Hot Encoding: The $N$ input tokens are first encoded as a matrix $\mathbf{T} \in \mathbb{R}^{|\mathcal{V}| \times N}$ of one-hot vectors.
Matrix Multiplication: The input embeddings are computed as the product $\mathbf{X} = \mathbf{\Omega}_e \mathbf{T}$ (see Figure 9).
Network Parameters: A typical configuration (e.g., $D=1024$ and $|\mathcal{V}|=30,000$) contains millions of parameters in $\mathbf{\Omega}_e$ to learn.
Note that the same token (e.g., “an”) always starts with the same initial embedding, regardless of position; context is only added by the transformer layers.

Transformer models

Finally, the embedding matrix $\mathbf{X}$ is passed through a sequence of $K$ transformer layers, comprising a transformer model. There are three primary types of architectures:

Encoder: Transforms the input embeddings into a rich representation for diverse tasks.
Decoder: Used to predict the subsequent tokens to extend the input text sequence.
Encoder-Decoder: Specialized for sequence-to-sequence tasks (e.g., machine translation), where one string is converted into another.

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

BERT is a prototypical encoder model designed to learn rich, bi-directional representations of language.

Architecture Details:
- Vocabulary: 30,000 tokens.
- Embeddings: 1024-dimensional word embeddings.
- Layers: 24 transformer layers.
- Attention: 16 heads per layer (query, key, and value dimension = 64).
- Parameters: $\sim 340$ million.
Learning Strategy: Transfer Learning:
- Pre-training: Captures general language statistics from a large, unlabeled corpus (3.3 billion words).
- Fine-tuning: Adapts the pre-trained model to specific tasks using smaller labeled datasets.

Pre-training

In the pre-training stage, the model is trained using self-supervision, allowing it to learn from massive amounts of data without manual labels.

Task: Masked Language Modeling (MLM):
- A small fraction of input tokens are randomly replaced with a special <maskStatus> token.
- The goal is to predict these missing words based on their surrounding context (Figure 10).
Learning Outcomes:
- Syntax: Learning that adjectives (e.g., red) typically precede nouns (house).
- Superficial Common Sense: Assigning a higher probability to “train” than “peanut” in the context of “The <maskStatus> pulled into the station”.
Bi-directional Context: Unlike autoregressive models, BERT uses both left and right context to predict missing words.
Secondary Task: Next-sentence prediction (NSP) aims to determine if two sentence segments were originally adjacent.

Fine-tuning

During fine-tuning, BERT’s parameters are specialized for a particular task by appending a task-specific output layer.

Text Classification:
- The embedding of the special <cls> token (placed at the start of every sequence) is mapped to a single value.
- Example: Sentiment Analysis (Figure 11 a), where the output is passed through a sigmoid to predict if a review is positive or negative.
Word Classification:
- Each input embedding $\mathbf{x}_n$ is mapped to a vector representing $E$ entity types.
- Example: Named Entity Recognition (NER) (Figure 11 b), classifying words as persons, places, or organizations.
Text Span Prediction:
- Used for Question Answering (e.g., SQuAD).
- The model predicts two numbers per token, indicating the probability that the token is the start or end of the answer span.

Architecture: GPT-3 is a prototypical decoder model. It consists of a series of transformer layers which, like their encoder counterparts, operate on learned word embeddings.
Objective: While an encoder builds a rich representation to be fine-tuned for diverse tasks, the decoder has one primary goal: next-token generation.
Inference: A decoder generates coherent text passages by iteratively feeding each predicted token back into the model as input for the subsequent step.

Language modeling

GPT-3 is an autoregressive language model. Its behavior is best understood by factoring the joint probability $\P(t_1, t_2, \dots, t_N)$ of a sequence of tokens.

Example: Consider the sentence: “It takes great courage to let yourself appear weak”. The joint probability can be factored into a chain of conditional probabilities: \[ \begin{aligned} \P(\text{sentence}) &= \P(\text{It}) \times \P(\text{takes} | \text{It}) \\ &\quad \times \P(\text{great} | \text{It takes}) \\ &\quad \times \cdots \\ &\quad \times \P(\text{weak} | \text{It takes great courage to let yourself appear}). \end{aligned} \]
General Formulation: An autoregressive model directly computes the conditional distribution of each token given all its predecessors, effectively finding: \[\P(t_1, t_2, \dots, t_N) = \P(t_1) \prod_{n=2}^{N} \P(t_n | t_1, \dots, t_{n-1}) .\]
Connection: Maximizing the joint probability of a text corpus is equivalent to optimizing the model for the next-token prediction task.

Masked self-attention

To train a decoder efficiently, we want to maximize the sum of log conditional probabilities for a full sentence in a single forward pass.

The Problem: Future Information Check:
- In a standard self-attention layer, every token can interact with every other token.
- When calculating the probability of a token (e.g., $\P(\text{great} | \text{It takes})$), the model could “cheat” by looking at the upcoming tokens in the ground-truth sentence (like “courage,” “to,” etc.) instead of learning to predict the next word.
The Solution: Causal Masking:
- We ensure tokens only attend to positions at or before their own index.
- Implementation: Set the dot products for future positions to $-\infty$ before the softmax[·] operation. This results in zero attention weight for all “future” tokens.
The Decoder Pipeline:
- Tokens are converted to embeddings and passed through series of transformer layers which use masked self-attention.
- Each output embedding represents a partial sentence.
- A final linear layer maps each embedding to the vocabulary size, followed by a softmax[·] to get probabilities.
- Training: Maximize the sum of log probabilities of the ground-truth next tokens at every position (Figure 12).

Generating text from a decoder

The autoregressive nature of the decoder makes it a generative model that can produce plausible text by sampling from its own output distributions.

Mechanism:
1. Start with an input sequence (usually beginning with a special <start> token).
2. The model outputs a probability distribution over the vocabulary.
3. Sample a token from this distribution or pick the most likely candidate (greedy approach).
4. Append the new token and feed the extended sequence back into the model to predict the next word.
Efficiency: Because of masking, prior embeddings are independent of future tokens and do not change, allowing for efficient recycling of earlier computation.
Common Search Strategies:
- Beam Search: Maintains multiple candidate completions and tracks their combined probabilities to find the most likely overall sequence.
- Top-k Sampling: Randomly draws the next word only from the $K$ most likely candidates to ensure coherence and avoid low-probability “linguistic dead ends.”

GPT3 and few-shot learning

Large language models (LLMs) like GPT3 apply these concepts on an unprecedented scale to achieve remarkable text generation and task-solving capabilities.

Large-scale transformers

GPT3 Scale:
- Layers: 96 transformer layers.
- Embeddings: Vocabulary entries mapped to vectors of size 12,288.
- Heads: 96 self-attention heads (query, key, value dimension = 128).
- Parameters: 175 billion.
- Training Data: 300 billion tokens.
- Context Window: 2,048 tokens processed simultaneously.
Example: Coherent Generation:
- Context: “Understanding Deep Learning is a new textbook from MIT Press by Simon Prince that”
- Generation: “s designed to offer an accessible, broad introduction to the field. …”
- The model can generate large, plausible (though sometimes factually inaccurate) bodies of text.

Zero-shot and few-shot learning

One surprising emergent property of models at this scale is their ability to perform diverse tasks without fine-tuning.

Few-Shot Learners:
- By providing a few examples of correct question/answer pairs in the prompt, the model “learns” the pattern and completes subsequent tasks correctly.
- Mechanism: Pattern Completion:
  - Poor English input: I eated the purple berries.
  - Good English output: I ate the purple berries.
  - Poor English input: Thank you for picking me as your designer. I’d appreciate it.
  - Good English output: Thank you for choosing me as your designer. I appreciate it.
Versatility: This phenomenon extends to code generation, arithmetic, translation, and general question answering.
Current Limitations:
- Performance is often erratic in practice.
- It remains unclear to what extent these models and their successors are truly extrapolating original solutions rather than interpolating or copying verbatim from their massive training corpora.

Translation between languages is an example of a sequence-to-sequence task. One common approach uses both an encoder (to compute a good representation of the source sentence) and a decoder (to generate the sentence in the target language). This is aptly called an encoder-decoder model.

Workflow Example: English to French Translation:
- Encoder Path: Receives the English sentence and processes it through a series of transformer layers to create an output representation for each token.
- Decoder Path: During training, the decoder receives the ground truth translation in French. It passes this through masked self-attention to predict the following word at each position.
- Conditioning: Crucially, the decoder layers also attend to the output of the encoder. Consequently, each French output word is conditioned on both the previous output words and the source English sentence (Figure 12).
Cross-Attention Mechanism:
- This is achieved by modifying the transformer layers in the decoder.
- Originally, these consisted of a masked self-attention layer followed by a neural network applied individually to each embedding (Figure 12).
- A new self-attention layer is added between these two components, in which the decoder embeddings attend to the encoder embeddings. This uses a version of self-attention known as encoder-decoder attention or cross-attention.
Cross-Attention Logic: The queries are computed from the decoder embeddings, and the keys and values from the encoder embeddings (Figure 13).

Transformers can be adapted to computer vision by treating an image as a sequence of patches.

Mechanism:
- Patching: Divide the image into $16 \times 16$ patches (Figure 14).
- Embedding: Each patch is mapped to an input embedding via a learned linear transformation.
- Positional Encoding: Standard 1D positional encodings are added to retain spatial relationships.
Architecture:
- An encoder-only model that utilizes a special <cls> token (similar to BERT).
- The <cls> token embedding is used for the final classification task.
Training Strategy:
- Uses supervised pre-training on massive datasets (e.g., 303 million labeled images from 18,000 classes).
- After pre-training, the final classification layer is replaced and fine-tuned for specific tasks.
Performance & Inductive Bias:
- Achieved competitive results (11.45% top-1 error on ImageNet).
- Conclusion: The strong inductive bias of CNNs (locality, translation invariance) can only be superseded by transformers when they are trained on extremely large-scale data.