Attention Mechanism: The Math Behind Transformers

Understanding Attention from Scratch

The attention mechanism is the core innovation behind transformers - the architecture powering GPT, BERT, and modern AI. This guide breaks down the math step by step with interactive visualizations.

We'll use a simple sentence to illustrate: "Crio helps you upskill in tech"

Step 1: Tokenization

First, we split the sentence into tokens. In simple cases, each word becomes a token. Real tokenizers (like BPE) may split words into subwords.

Step 2: Token Embeddings

Each token is converted into a vector of numbers (embedding). In real models, these are hundreds or thousands of dimensions. We'll use 4 dimensions for visualization.

Step 3: The Problem - Why Do We Need Q, K, V?

We have embeddings [e1, e2, e3, e4, e5, e6] for each word. The goal is to make each embedding better by letting it gather information from other words.

Consider the word "you" in our sentence. To understand "you" better, we need to figure out:

WHO should "you" look at? (Crio? helps? upskill?)
WHAT info should "you" grab from that word?

This is where Q, K, V come in - they split this into three roles:

Vector	Role	Example for "you"
Q (Query)	"What am I looking for?"	"I need to know WHO is helping me"
K (Key)	"What do I have to offer?"	"Crio says: I'm a company/helper!"
V (Value)	"Here's my actual content"	The actual information to grab once matched

The key insight: Q and K find the match, V provides the content.

Step 4: Computing Q, K, V (Matrix Multiplication)

Each embedding gets transformed into Q, K, and V vectors using learned weight matrices. The operation is simple: matrix multiplication.

Q = W_Q × embedding
K = W_K × embedding
V = W_V × embedding

Concrete example for "Crio":

embedding("Crio") = [0.8, 0.2, 0.5, 0.9]
W_Q = [[0.5, 0.3, 0.2, 0.4], [0.2, 0.6, 0.3, 0.1], [0.4, 0.2, 0.5, 0.3], [0.3, 0.4, 0.2, 0.6]]
The math (each output = weighted sum of inputs):
Q[0] = 0.8×0.5 + 0.2×0.2 + 0.5×0.4 + 0.9×0.3 = 0.91
Q[1] = 0.8×0.3 + 0.2×0.6 + 0.5×0.2 + 0.9×0.4 = 0.82
Q[2] = 0.8×0.2 + 0.2×0.3 + 0.5×0.5 + 0.9×0.2 = 0.65
Q[3] = 0.8×0.4 + 0.2×0.1 + 0.5×0.3 + 0.9×0.6 = 1.03
Q("Crio") = [0.91, 0.82, 0.65, 1.03]

Step 5: The Key Insight - It's All Just W × E

Notice something? Q, K, and V are computed with the exact same operation:

Q = W_Q × E   |   K = W_K × E   |   V = W_V × E

Same math. Different weight matrices. The "magic" is in the learned weights.

What	Learned?
The operation (matrix multiply)	No, fixed
The weight matrices (W_Q, W_K, W_V)	Yes! These are learned during training

Training adjusts W_Q, W_K, W_V via backpropagation so that Q and K produce good matches, and V provides useful content.

Step 6: Computing Attention Scores

Now we find matches! For each word, we compute: how similar is my Q to every other word's K?

The similarity is measured using the dot product. If Q and K point in similar directions → high score → high attention.

score = Q · K / √d

(We divide by √d to keep gradients stable during training)

Step 7: Softmax Normalization

Raw scores can be any number. We use softmax to convert them into a probability distribution that sums to 100%.

This answers: "How should I distribute my attention across all words?"

Step 8: Full Attention Matrix

When we compute attention for all tokens, we get a matrix showing how every token attends to every other token.

Step 9: Computing the Output (Weighted Sum of V)

Finally! We use the attention weights to compute a weighted sum of Value vectors.

output = attention_weights × V

This creates a new, contextualized embedding - the original embedding enriched with information from other words it attended to.

Step 10: The Complete Attention Flow

Let's see the entire attention mechanism as a step-by-step animation.

Step 11: From Attention to LLM

Attention transforms embeddings E → E' (contextualized). But how does this become an LLM that can answer questions?

The forward pass:

# Input: "Crio helps you upskill"
E = [e1, e2, e3, e4]          # embeddings
Q = W_Q × E
K = W_K × E
V = W_V × E
scores = Q × K^T
weights = softmax(scores / √d)
E' = weights × V          # contextualized!
prediction = predict(E')   # predict next token: "in"

A real LLM adds:

Stack 96 layers - Do this attention block 96 times
Multi-head attention - Run 8-96 different Q,K,V sets in parallel
Feed-forward network - After attention, apply a neural net to each position
Residual connections - E' = E + Attention(E)
Positional encoding - Tell the model word ORDER

Step 12: How Does Q&A and Reasoning Work?

LLMs don't "understand" questions. They predict the next token based on patterns in training data.

Example:

Your prompt: "What does Crio help you do? Answer:"
Model predicts: "upskill" (because that pattern appeared in training data)

What about reasoning? Chain-of-thought helps:

Direct:     Question → Answer (hard jump)
With CoT: Question → Step 1 → Step 2 → Answer (each step in context)

Each intermediate step becomes context for the next prediction. Easier than one giant leap!

📐 The Complete Attention Formula

      Attention(Q, K, V) = softmax(QKT / √d) × V
    

TL;DR: Q and K find matches (who should attend to whom), V provides the content to gather. The weight matrices W_Q, W_K, W_V are learned during training. Stack this 96 times with multi-head attention, and you have a transformer!

Launch Offer