What is a Transformer?

Transformer is a Neural Networks architecture that has fundamentally changed the approach to Artificial Intelligence. Transformer was first introduced in the seminal paper “Attention is All You Need” in 2017 and has since become the go-to architecture for deep learning models, powering text-generative models like OpenAI’s GPT, Meta’s Llama, and Google’s Gemini. Beyond text, Transformer is also applied in audio generation, image recognition, protein structure prediction, and even game playing, demonstrating its versatility across numerous domains.

Core Principles and Architecture

The fundamental concept of a Transformer model revolves around its encoder-decoder structure and self-attention mechanisms

Transformer, proposed in the paper Attention is All You Need, is a neural network architecture solely based on self-attention mechanism and is very parallelizable. A Transformer model handles variable-sized input using stacks of self-attention layers instead of RNNs or CNNs. This general architecture has a number of advantages:

1 It makes no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects.

2 Layer outputs can be calculated in parallel, instead of a series like an RNN.

3 Distant items can affect each other’s output without passing through many recurrent steps, or convolution layers.

4 It can learn long-range dependencies.

The detailed working of the Transformers can be found on Transformer Explainer LLM Transformer Model Visually Explained

Layers of a Transformer

Embedding
- Breaks the given input into smaller chunks called tokens.
- Each token is then associated with a vector (a list of numbers), that encode meaning of the chunk
- Assuming a vector is a co-ordinate, words with similar meaning tend to land on vectors close to each other in that space
Attention:
- the vectors pass through an attention block, where they talk to one another to update the meaning of the token based on the context
Multilayer Perceptron / Feed forward layer
- All vectors undergo the same operation in parallel
- they don’t talk to one another
- asks a long list of questions on the vector and update it accordingly,
- eg qns: Is it piece of a bigger word?, noun?, quanity?, assertive?

Working of a Transformer:

The data first enters through the Embedding layer.
Then the vectorised token in iterated continously through the Attention and Feed Forward layer repeatedly
The data goes through the Attention and Feed Forward layer until the last vector contains all the essential meaning of the entire text
Then some operation is performed on the last vector that provides a probability distribution over all possible tokens.

Transformer

First developed by Google in 2017 for autocompletion of text.

How Chat bot’s work

A initilal user prompt is provided to help set the stage, it is provided as a dialogue b/w the user and an AI Assistan, to the Transformer, the Transformer then completes the dialogue of the AI Assistang

Back Propagation Neural Networks

Prabanjan's Cosmos

Explorer

Transformer

What is a Transformer?

Core Principles and Architecture

Layers of a Transformer

Working of a Transformer:

Transformer

How Chat bot’s work

Graph View

Table of Contents

Backlinks