[[Reading Status Button]]
How do Transformers work?
A bit of Transformer history
March 2023: Mistral, a 7-billion-parameter language model that outperforms Llama 2 13B across all evaluated benchmarks, leveraging grouped-query attention for faster inference and sliding window attention to handle sequences of arbitrary length.
GPT-like (also called auto-regressive Transformer models)
BERT-like (also called auto-encoding Transformer models)
T5-like (also called sequence-to-sequence Transformer models)
Transformers are language models
All the Transformer models
have been trained on large amounts of raw text in a self-supervised fashion.
Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!
This
develops a statistical understanding of the language
but it’s less useful for specific practical tasks.
Because of this, the
model then goes through a process called transfer learning or fine-tuning
the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.
An example
is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.
Another example is masked language modeling, in which the model predicts a masked word in the sentence.
Transformers are big models
Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.
Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.
For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.
always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.
General Transformer architecture
The model is primarily composed of two blocks:
- Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
- Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.
Each of these parts can be used independently, depending on the task:
- Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
- Decoder-only models: Good for generative tasks such as text generation.
- Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.
Attention layers
this layer will tell the model to pay specific attention to certain words in the sentence you passed it
a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.
The original architecture
originally designed for translation.
In the encoder, the attention layers can use all the words in a sentence
The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated
The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:
Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word.
The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the ==special padding 1word== used to make all the inputs the same length when batching together sentences.
Architectures vs. checkpoints
- Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
- Checkpoints: These are the weights that will be loaded in a given architecture.
- Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.
Footnotes
-
Why do they need to be formatted to have same length? ↩