Back to Articles

Everything About Transformers

by: Krupa Dave

The research paper "Attention is All You Need" is regarded as one of the most important & groundbreaking publications in the realm of ML. The paper introduces the transformer architecture and the attention mechanism, yet many still struggle to wrap their head around it.

When I posted my progress update on my encoder block written in CUDA (Python + Numba), a lot of responses echoed a similar theme: "I want to understand how transformers work from the ground up."

This got me thinking. What really helped ME understand the transformer? It was story-telling & illustrations. Every model in the history of language modeling was built to fix a problem the last one could not solve (which evolved into the transformer). I've also always learned best by looking at illustrations and dissecting complex ideas into visuals I can follow (Jay Allamar's Illustrated Transformer does a good job of this). So why not put these 2 ideas together, and tell the story of how we got to the transformer, and an illustrated break-down of the transformer architecture itself.

So, this article isn't a tutorial. It's a guide I wish I had when I first started out - a visual story of how transformers came to life. You'll find personal illustrations, simplified explanations, and links to the resources that helped me most. Whether you're here to build a transformer model from scratch, or fell curious how we got from neural nets to today's GPTs, I hope this gives you a place to start.

A History Lesson

A language model is a machine learning model that is trained to understand, predict and generate human language. Early attempts used simple feedforward networks which were good at recognizing fixed patterns, but couldn't handle sequences. They had NO memory of order or context.

This gap led to the creation of recurrent neural networks (RNNs). RNNs were the first real step toward giving models memory for sequential data. From there, researchers kept building new model variations (with tweaks), so that each one was created to fix the limitation of the last.

This progression of ideas eventually gave rise to the transformer in 2017. The timeline and chart below outline why each model was introduced, how it worked, and the drawbacks that led to the next improvement.

Timeline of NLP developments from 1950s to 2018+, showing the evolution from feedforward networks to transformers

Timeline of Key Language Model Architectures

Evolution of language models: what each introduced and WHY the next was needed

Evolution of language models: what each introduced and WHY the next was needed

TL;DR — Why Did Transformers Matter?

To summarize the above chart, what were old models missing exactly?

  • They struggled with remembering context over long sentences.
  • When they tried, they either forgot too quickly or became unstable.
  • Models that handled memory better were still slow and hard to scale.
  • And compressing whole sequences into one chunk meant losing a lot of detail.

Then came a bold idea: Some Google Brain researchers asked: "What if we removed the whole idea of recurrence from RNNs and CNNs… and just used attention instead?" That simple shift led to the birth of the Transformer — captured perfectly in the title of their paper, "Attention Is All You Need."

What the Transformer changed:

  • It let models pay attention to all the words at once (self-attention).
  • It trained much faster by processing sequences in parallel.
  • It kept word order with positional encodings.
  • And it scaled effortlessly — powering GPT, BERT, and today's foundation models.

Transformer Architecture: A Thorough Breakdown

The Transformer is comprised of 2 main building blocks:

  1. Encoder
  2. Decoder

Inside these blocks, it relies on 5 core mechanisms that work together:

  1. Attention (which comes in 3 main variants: self-attention, cross-attention, and multi-head attention)
  2. Feed-forward networks
  3. Layer normalization
  4. Positional encoding (plus the initial input embeddings that convert words into numbers)
  5. Residual connections

Below is the full original Transformer architecture from the paper Attention Is All You Need (2017). Think of this as a reference point. Don't worry if it looks overwhelming at first. In the rest of this article, I'll break down each component step by step and make sense of how it all fits together.

Full Transformer Architecture Diagram

Transformer Architecture as illustrated in Attention is All You Need (2017). Encoder on the left & Decoder on the right.

We can think of the Transformer as a black box: it takes an input sequence (eg. like a sentence "I like cats" in English) and produces an output sequence (like "J'aime les chats", the same sentence in French). Its power comes from the encoder–decoder structure: the encoder turns the input into a numerical representation, and the decoder uses that to generate the output one token at a time. In the original paper, both the encoder and decoder are shown as stacks of six layers (N=6). Below, I've illustrated this structure, breaking it down into the encoder, the decoder, and their repeating layers.

High-level transformer view: Input 'I like cats' goes through Transformer to output 'J'aime les chats'
Transformer internal structure showing Encoding Block and Decoding Block
Detailed transformer architecture with 6 encoder layers and 6 decoder layers

High-level view of Input -> Transformer -> Output

Slide 1 of 3 - Progressive breakdown of transformer architecture

Attention

Now we've reached the mechanism that's been referenced throughout this article, and the one that powers the core of the Transformer architecture. Introducing...Attention.

Before we talk about attention, here's what you need to keep in mind:

  • You now understand how the Transformer turns input sequences into output sequences using encoder and decoder stacks
  • Each token in the input becomes a vector (thanks to embeddings), and each layer refines those vectors. (Note: If you aren't familiar with the terms "tokenization" or "word embeddings" then they are explained in an Aside under "Self-Attention" in the article!)
  • Unlike older models (as illustrated in the evolution of language models chart), transformers do not rely on memory flowing through time. Instead, they rely on direct connections between all tokens via attention!

Hence, Attention is a mechanism that determines the importance of each component in a sequence relative to every other component in that sequence.

With that being understood, now we'll uncover attention in more detail (which is discussed as 3 different variants in the paper: Self-Attention, Multi-Head Attention, and Cross-Attention.

1) Self-Attention

Say you're given the following sentence: "he swung the bat with incredible force". On its own, the word "bat" could mean either an animal, or a baseball bat. However, the CONTEXT of the sentence is what tells us (and a computer) that it's the baseball bat. Attention helps minimize this ambiguity, by not treating the word (eg. bat) in isolation but by "paying attention" to other words in the sequence.

Full Transformer Architecture Diagram

Attention uses context to resolve ambiguity. Here "bat" means baseball, not the animal.

Before Self-Attention, your input sequence must go through 2 steps: tokenization and token embedding.

Aside on Tokenization & Word Embeddings

1) Tokenization

Computers don't understand words, they understand numbers. So, the first step is to break down a sentence into smaller units called tokens. There are different ways to tokenize text:

Different types of tokenization: breaking text into words, subwords or characters

Different types of tokenization: breaking text into words, subwords or characters

2) Token Embeddings

Then each of your tokens are mapped to a unique token ID from a pre-defined vocabulary. However, transformers cannot process integers directly; they need vectors. That's where token embeddings come in. An embedding matrix is a giant table:

  • V: vocabulary size (the total number of unique tokens a model knows about, eg. 50,000)
  • d: embedding dimension (the length of the vector used to represent each token or the number of "features" a token is represented by (eg. 512). It's a hyperparameter of the model.
Turning text into embeddings the model can understand

Turning text into embeddings the model can understand

Each row in this matrix is now a trainable vector representing a token: (example: Transform → Token ID: 1231 → [0.1, 0.3, 0.4, 0.9, 0.8])

Once each input ID is mapped to its embedding, the entire input sentence becomes a 2D tensor!

The key takeaway is that once text is tokenized and mapped into embeddings, each token is now represented as a vector that carries semantic meaning. Similar words or concepts will have embeddings that are close together in this high-dimensional space (eg. bat, swung, hit, ball). These embeddings are what the Transformer operates on.

Attention relies on 3 key inputs for each token (i.e. word in a sequence):

  • A Query Vector (Q): what this token wants to find out
  • A Key Vector (K): what this token offers to others
  • A Value Vector (V): the actual content this token can contribute to

Here's an analogy to intuitively understand Q, K and V vectors:

  • Picture doing a simple Google Search. The Query is what you type into the search bar (i.e. "how do transformers work?"), the Key are the titles of webpages stored by the data engine, and the Value are the actual content of those.

In self attention, the Query, Key and Value matrices are all derived from the same input vector. They are calculated by a matrix multiplication of the input vector (X) and a learnt weight matrix (W) for each Q, K and V matrix. The weight matrix (W) is learnt on the loss function of the transformer during backpropagation.

How Query, Key and Value matrices are derived

How Query, Key and Value matrices are derived

After you type something into Google, you want the search engine to match your query against page titles and keywords to find the most relevant results. In a transformer, this is exactly what happens when your query vector is compared against the key vectors...the model needs a way to measure which ones are most similar or relevant. This is done using a compatibility function, which is nothing other than the "dot-product" (a simple yet powerful way to check how aligned 2 of your vectors are). The results of these comparisons are laid out in a compatibility matrix, where higher dot product values indicate a stronger match between a query and a key.

Compatibility matrix showing how aligned query and key vectors are

Compatibility matrix showing how aligned query and key vectors are. Every cell marked "high" is likely to score high in compatibility (i.e. higher dot-product value)

Residual Connections

Residual connections, also known as skip connections, are a key feature of transformer architectures. They help mitigate the vanishing gradient problem by allowing gradients to flow more easily through the network. Essentially, each layer in the transformer receives the input from the previous layer plus the output from the same layer in the previous step.

Rest of the article TO BE CONTINUED...

This is just the beginning of our explorations into transformers. Stay tuned for parts 2 and 3 :)