[[Reading Status Button]]

Transformer was first introduced in the seminal paper “Attention is All You Need”

operate on the principle of next-word prediction

core innovation and power of Transformers lie in their use of self-attention mechanism, which allows them to process entire sequences and capture long-range dependencies more effectively

Transformer Architecture

Every text-generative Transformer consists of these three key components:

Embedding: Text input is divided into smaller units called tokens, which can be words or subwords. These tokens are converted into numerical vectors called embeddings, which capture the semantic meaning of words.

  • Transformer Block is the fundamental building block of the model that processes and transforms the input data. Each block includes:
    • Attention Mechanism, the core component of the Transformer block. It allows tokens to communicate with other tokens, capturing contextual information and relationships between words.
    • MLP (Multilayer Perceptron) Layer, a feed-forward network that operates on each token independently. While the goal of the attention layer is to route information between tokens, the goal of the MLP is to refine each token’s representation.

Output Probabilities: The final linear and softmax layers transform the processed embeddings into probabilities, enabling the model to make predictions about the next token in a sequence.

Embeddings:

transforms the text into a numerical representation that the model can work with

To convert a prompt into embedding, we need to 1) tokenize the input, 2) obtain token embeddings, 3) add positional information, and finally 4) add up token and position encodings to get the final embedding.

figure 1. expanding the embedding layer view, showing how the input prompt is converted to a vector representation. the process involves (1) tokenization, (2) token embedding, (3) positional encoding, and (4) final embedding.

Step 1: Tokenization

process of breaking down the input text into smaller, more manageable pieces called tokens.

can be a word or a subword.

Full vocabulary of tokens is decided before training the model

the vector representation are obtained from embeddings

Step 2. Token Embedding

GPT-2 (small) represents each token in the vocabulary as a 768-dimensional vector; the dimension of the vector depends on the model. These embedding vectors are stored in a matrix of shape (50,257, 768), containing approximately 39 million parameters! This extensive matrix allows the model to assign semantic meaning to each token.

Step 3. Positional Encoding

Embedding layer also encodes information about each token’s position in the input prompt. Different models use various methods for positional encoding.

Step 4. Final Embedding

we sum the token and positional encodings to get the final embedding representation.

captures both the semantic meaning of the tokens and their position in the input sequence.

Transformer Block

The core of the Transformer’s processing

which comprises multi-head self-attention and a Multi Layer Perceptron layer.

Most models consist of multiple such blocks that are stacked sequentially one after the other. The token representations evolve through layers, from the first block to the last one, allowing the model to build up an intricate understanding of each token. This layered approach leads to higher-order representations of the input. The GPT-2 (small) model we are examining consists of 12 such blocks.

multi-head self-attention

Multi-Head Self-Attention

enables the model to focus on relevant parts of the input sequence, allowing it to capture complex relationships and dependencies within the data.

Step 1: Query, Key, and Value Matrices

Figure 2. Computing Query, Key, and Value matrices from the original embedding.

Each token’s embedding vector is transformed into three vectors: Query (Q), Key (K), and Value (V). These vectors are derived by multiplying the input embedding matrix with learned weight matrices for Q, K, and V.

a web search analogy to help

Query (Q) is the search text you type in the search engine bar. This is the token you want to “find more information about”.

Key (K) is the title of each web page in the search result window. It represents the possible tokens the query can attend to.

Value (V) is the actual content of web pages shown. Once we matched the appropriate search term (Query) with the relevant results (Key), we want to get the content (Value) of the most relevant pages.

By using these QKV values, the model can calculate attention scores, which determine how much focus each token should receive when generating predictions.

Step 2: Multi-Head Splitting

Query, key, and Value vectors are split into multiple heads—in GPT-2 (small)‘s case, into 12 heads. Each head processes a segment of the embeddings independently, capturing different syntactic and semantic relationships. This design facilitates parallel learning of diverse linguistic features, enhancing the model’s representational power.

Step 3: Masked Self-Attention

In each head, we perform masked self-attention calculations. This mechanism allows the model to generate sequences by focusing on relevant parts of the input while preventing access to future tokens.

  • Attention Score: The dot product of Query and Key matrices determines the alignment of each query with each key, producing a square matrix that reflects the relationship between all input tokens.

  • Masking: A mask is applied to the upper triangle of the attention matrix to prevent the model from accessing future tokens, setting these values to negative infinity. The model needs to learn how to predict the next token without “peeking” into the future.

Why should it masked?

  • Softmax: After masking, the attention score is converted into probability by the softmax operation which takes the exponent of each attention score. Each row of the matrix sums up to one and indicates the relevance of every other token to the left of it.

Step 4: Output and Concatenation

The model uses the masked self-attention scores and multiplies them with the Value matrix to get the final output of the self-attention mechanism. GPT-2 has 12 self-attention heads, each capturing different relationships between tokens. The outputs of these heads are concatenated and passed through a linear projection.

Why a linear projection?

Link to original

Multi Layer Perceptron

MLP: Multi-Layer Perceptron or FFN: Feed Forward Netwok

Figure 4. Using MLP layer to project the self-attention representations into higher dimensions to enhance the model’s representational capacity.

After the multiple heads of self-attention capture the diverse relationships between the input tokens, the concatenated outputs are passed through the Multilayer Perceptron (MLP) layer to enhance the model’s representational capacity. The MLP block consists of two linear transformations with a GELU activation function in between. The first linear transformation increases the dimensionality of the input four-fold from 768 to 3072. The second linear transformation reduces the dimensionality back to the original size of 768, ensuring that the subsequent layers receive inputs of consistent dimensions. Unlike the self-attention mechanism, the MLP processes tokens independently and simply map them from one representation to another.

Why projected to a higher dimension and back?

Link to original

Output Probabilities

After the input has been processed through all Transformer blocks, the output is passed through the final linear layer to prepare it for token prediction. This layer projects the final representations into a 50,257 dimensional space, where every token in the vocabulary has a corresponding value called logit. Any token can be the next word, so this process allows us to simply rank these tokens by their likelihood of being that next word. We then apply the softmax function to convert the logits into a probability distribution that sums to one. This will allow us to sample the next token based on its likelihood.

Figure 5. Each token in the vocabulary is assigned a probability based on the model’s output logits. These probabilities determine the likelihood of each token being the next word in the sequence.

The final step is to generate the next token by sampling from this distribution The temperature hyperparameter plays a critical role in this process. Mathematically speaking, it is a very simple operation: model output logits are simply divided by the temperature:

temperature = 1: Dividing logits by one has no effect on the softmax outputs.

temperature < 1: Lower temperature makes the model more confident and deterministic by sharpening the probability distribution, leading to more predictable outputs.

temperature > 1: Higher temperature creates a softer probability distribution, allowing for more randomness in the generated text – what some refer to as model “creativity”.

In addition, the sampling process can be further refined using top-k and top-p parameters:

  • top-k sampling: Limits the candidate tokens to the top k tokens with the highest probabilities, filtering out less likely options.

  • top-p sampling: Considers the smallest set of tokens whose cumulative probability exceeds a threshold p, ensuring that only the most likely tokens contribute while still allowing for diversity.

By tuning temperature, top-k, and top-p, you can balance between deterministic and diverse outputs, tailoring the model’s behavior to your specific needs.


Questions:


Related: