[[Reading Status Button]]

How do Transformers work?

A bit of Transformer history

March 2023: Mistral, a 7-billion-parameter language model that outperforms Llama 2 13B across all evaluated benchmarks, leveraging grouped-query attention for faster inference and sliding window attention to handle sequences of arbitrary length.

GPT-like (also called auto-regressive Transformer models)

BERT-like (also called auto-encoding Transformer models)

T5-like (also called sequence-to-sequence Transformer models)

Transformers are language models

All the Transformer models

have been trained on large amounts of raw text in a self-supervised fashion.

Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This

develops a statistical understanding of the language

but it’s less useful for specific practical tasks.

Because of this, the

model then goes through a process called transfer learning or fine-tuning

the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

An example

is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.

Another example is masked language modeling, in which the model predicts a masked word in the sentence.

Transformers are big models

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.

For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.

General Transformer architecture

The model is primarily composed of two blocks:

  • Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  • Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Each of these parts can be used independently, depending on the task:

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  • Decoder-only models: Good for generative tasks such as text generation.
  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

Attention layers

“Attention Is All You Need”!

this layer will tell the model to pay specific attention to certain words in the sentence you passed it

a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

The original architecture

originally designed for translation.

In the encoder, the attention layers can use all the words in a sentence

The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the ==special padding 1word== used to make all the inputs the same length when batching together sentences.

Architectures vs. checkpoints

  • Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
  • Checkpoints: These are the weights that will be loaded in a given architecture.
  • Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.

Footnotes

  1. Why do they need to be formatted to have same length?