Multi-Head Self-Attention
enables the model to focus on relevant parts of the input sequence, allowing it to capture complex relationships and dependencies within the data.
Step 1: Query, Key, and Value Matrices
Figure 2. Computing Query, Key, and Value matrices from the original embedding.
Each tokenโs embedding vector is transformed into three vectors: Query (Q), Key (K), and Value (V). These vectors are derived by multiplying the input embedding matrix with learned weight matrices for Q, K, and V.
a web search analogy to help
Query (Q) is the search text you type in the search engine bar. This is the token you want to โfind more information aboutโ.
Key (K) is the title of each web page in the search result window. It represents the possible tokens the query can attend to.
Value (V) is the actual content of web pages shown. Once we matched the appropriate search term (Query) with the relevant results (Key), we want to get the content (Value) of the most relevant pages.
By using these QKV values, the model can calculate attention scores, which determine how much focus each token should receive when generating predictions.
Step 2: Multi-Head Splitting
Query, key, and Value vectors are split into multiple headsโin GPT-2 (small)โs case, into 12
heads. Each head processes a segment of the embeddings independently, capturing different syntactic and semantic relationships. This design facilitates parallel learning of diverse linguistic features, enhancing the modelโs representational power.
Step 3: Masked Self-Attention
In each head, we perform masked self-attention calculations. This mechanism allows the model to generate sequences by focusing on relevant parts of the input while preventing access to future tokens.
-
Attention Score: The dot product of Query and Key matrices determines the alignment of each query with each key, producing a square matrix that reflects the relationship between all input tokens.
-
Masking: A mask is applied to the upper triangle of the attention matrix to prevent the model from accessing future tokens, setting these values to negative infinity. The model needs to learn how to predict the next token without โpeekingโ into the future.
Why should it masked?
- Softmax: After masking, the attention score is converted into probability by the softmax operation which takes the exponent of each attention score. Each row of the matrix sums up to one and indicates the relevance of every other token to the left of it.
Step 4: Output and Concatenation
The model uses the masked self-attention scores and multiplies them with the Value matrix to get the final output of the self-attention mechanism. GPT-2 has 12
self-attention heads, each capturing different relationships between tokens. The outputs of these heads are concatenated and passed through a linear projection.
Why a linear projection?