GPT-3

Embedding

The provided input is tokenized into small chunks, then the chunks are converted to vectors
The model has 50K words, which determine the vector for the token
This is called embedding, consider as co-ordinates in a high-dimensional space, GPT-3 embeds words in 12,288 dimension
The vector values are not defined by humans, these are just fine tuned by the model during the training
The vector value contains both the meaning and the position of the word in the input

Insights into Embedding

Python model gensim
words with similar meaning tend to be close to one another
difference between gender opposites, across words have fairly similar difference, (distance) between them
It appears that each direction refers to a unique trait, like gender, World war 2 leaders and plurals
this can be confirmed by checking the dot product of the vector, the dot product of plural nouns are alway higher than the dot product of singular nouns in a given direction
The embeddings not just contain meaning, but also the context
Personal observation: That is how the final probability of the word for output is chosen, After the final layer, the last vector points to a space, the probability matrix is finally calculated using the all words that are near to the space, the distance could be the probability
Directions correspond to schemantic meaning, like gender, size, plural forms

Attention

Self Attention

consider tower, depending on the preceeding words, the vector value changes
the last vector in the input is updated with the contextual meaning of the entire input so teat it contain the entire context.
Attention block consists of many different head layers run in parallel

Identifying which words are relevant for updating which other words
- A query vector( a question ) is defined as a much smaller vector than the original input
- Computing this query is like multiplying a matrix of weights Wq with that embedding vector
- The weight is multiplied with all the embeddings to get a query vector for each of the embedding
- How might this work, every direction represents a schemantic meaning, so a direction for nouns and there is a direction for adjectives
- there is a key matrix Wk, calculated the same way as the query matrix wq, this produces the keys, a second set of vectors
- keys are like answers to the queries
  - the dot product of both the queries and keys vectors is calculated, the dot products would be large if the query was attended to by the token in that position
  - can be -infinfity to infinity
  - This score is a measure of how relevant the word is, in updating the meaning of every other word
- the score is then used for calculating a weighted sum and weighted with a relevance
- A softmax function is used on the resulting scores vectors
- This scores matrix is a representation of weights on how relevant is the word on the left (keys vector) is to updating the meaning of the word on the top ( the query vector), called Attention Pattern
- The attention is always determined by the preceeding tokens, never by the token that follow the current token
- So in the attention pattern, the score for every word that computes a relevance of its attention with another word always has to be zero like if i>j has to be zero
- we can’t just change them to 0, because then the columns would not be normalized so
- this is done by setting the values to - infinity before performing softmax
- This size of the attention pattern is square of the context length, so the context length is a bottle neck to LLM
- Various Attention Mechanisms to scale context
Updating the embeddings, allowing words to pass information to other words they are relevant to
- simple approach
  - use a 3rd matrix, value matrix (consistss of tunable weights)
  - 1 matrix - 1st word, 2nd matrix - 2nd word
  - we multiply 3rd matrix with 1st matrix, then add the result to the 2nd matrix
  - the value matrix is like how the influenced vector should move when being influenced by this vector
  - Multiply the value matrix with every embedding to produce value vector
  - Multiply the value vector with the attention score of each embedding
  - Add all these newly generated values to get the change, delta e
  - Add thee cahnge to the original embedding, this step ensure the attention is properly transferred across tokens
  - Ref image: Single-headed-atttention-transformer.png
  - The entire process is called ‘One head of attention’.
  - The value matrix needs to be a square matrix, but this would cause performance issues, when running multiple attention heads in parallel, so the square matrix is broken down into two, one use to scale down the value into a smaller dimension, and another to scale it back upto to its original dimension.
  - GPT - 3 has 96 attention blocks, each attention block has different attention mechanism, i.e query matrices and key matrices
  - The sum of the all the change from all the attention blocks is then added to the original embedding.
- multiheaded context approach

Cross Attention

The original embedding is processed many times through the Attention and Multilayer perceptron layer, repeatedly. Because the udpated embedding from the first layer can add more detail to another word.

Feed Forward layer

another matrix, that maps the last vector to a list of 50K values, one for each token in the vocabulary
A normalization function that assigns the probability distribution , the normalization function is called softmax
Why we don’t use the other vector, why just the last vector? (unanswered:: true)
because the model predict every next word simultaneously, this way the model is more efficient

Softmax:

Each value between 0 - 1, and they need to add upto 1, This is performed using the softmax function,
The higher value tends to be close to 1, and lower values tend to be close to 0
Inputs called logits
Outputs called probabilities

Prabanjan's Cosmos

Explorer