Understanding Attention from Scratch
The attention mechanism is the core innovation behind transformers - the architecture powering GPT, BERT, and modern AI. This guide breaks down the math step by step with interactive visualizations.
We'll use a simple sentence to illustrate: "Crio helps you upskill in tech"
Step 1: Tokenization
First, we split the sentence into tokens. In simple cases, each word becomes a token. Real tokenizers (like BPE) may split words into subwords.
Step 2: Token Embeddings
Each token is converted into a vector of numbers (embedding). In real models, these are hundreds or thousands of dimensions. We'll use 4 dimensions for visualization.
Step 3: The Problem - Why Do We Need Q, K, V?
We have embeddings [e1, e2, e3, e4, e5, e6] for each word. The goal is to make each embedding better by letting it gather information from other words.
Consider the word "you" in our sentence. To understand "you" better, we need to figure out:
- WHO should "you" look at? (Crio? helps? upskill?)
- WHAT info should "you" grab from that word?
This is where Q, K, V come in - they split this into three roles:
| Vector | Role | Example for "you" |
|---|---|---|
| Q (Query) | "What am I looking for?" | "I need to know WHO is helping me" |
| K (Key) | "What do I have to offer?" | "Crio says: I'm a company/helper!" |
| V (Value) | "Here's my actual content" | The actual information to grab once matched |
The key insight: Q and K find the match, V provides the content.
Step 4: Computing Q, K, V (Matrix Multiplication)
Each embedding gets transformed into Q, K, and V vectors using learned weight matrices. The operation is simple: matrix multiplication.
Q = W_Q × embedding
K = W_K × embedding
V = W_V × embedding
Concrete example for "Crio":
embedding("Crio") = [0.8, 0.2, 0.5, 0.9]
W_Q = [[0.5, 0.3, 0.2, 0.4], [0.2, 0.6, 0.3, 0.1], [0.4, 0.2, 0.5, 0.3], [0.3, 0.4, 0.2, 0.6]]
The math (each output = weighted sum of inputs):
Q[0] = 0.8×0.5 + 0.2×0.2 + 0.5×0.4 + 0.9×0.3 = 0.91
Q[1] = 0.8×0.3 + 0.2×0.6 + 0.5×0.2 + 0.9×0.4 = 0.82
Q[2] = 0.8×0.2 + 0.2×0.3 + 0.5×0.5 + 0.9×0.2 = 0.65
Q[3] = 0.8×0.4 + 0.2×0.1 + 0.5×0.3 + 0.9×0.6 = 1.03
Q("Crio") = [0.91, 0.82, 0.65, 1.03]
Step 5: The Key Insight - It's All Just W × E
Notice something? Q, K, and V are computed with the exact same operation:
Q = W_Q × E | K = W_K × E | V = W_V × E
Same math. Different weight matrices. The "magic" is in the learned weights.
| What | Learned? |
|---|---|
| The operation (matrix multiply) | No, fixed |
| The weight matrices (W_Q, W_K, W_V) | Yes! These are learned during training |
Training adjusts W_Q, W_K, W_V via backpropagation so that Q and K produce good matches, and V provides useful content.
Step 6: Computing Attention Scores
Now we find matches! For each word, we compute: how similar is my Q to every other word's K?
The similarity is measured using the dot product. If Q and K point in similar directions → high score → high attention.
score = Q · K / √d
(We divide by √d to keep gradients stable during training)
Step 7: Softmax Normalization
Raw scores can be any number. We use softmax to convert them into a probability distribution that sums to 100%.
This answers: "How should I distribute my attention across all words?"
Step 8: Full Attention Matrix
When we compute attention for all tokens, we get a matrix showing how every token attends to every other token.
Step 9: Computing the Output (Weighted Sum of V)
Finally! We use the attention weights to compute a weighted sum of Value vectors.
output = attention_weights × V
This creates a new, contextualized embedding - the original embedding enriched with information from other words it attended to.
Step 10: The Complete Attention Flow
Let's see the entire attention mechanism as a step-by-step animation.
Step 11: From Attention to LLM
Attention transforms embeddings E → E' (contextualized). But how does this become an LLM that can answer questions?
The forward pass:
# Input: "Crio helps you upskill"
E = [e1, e2, e3, e4] # embeddings
Q = W_Q × E
K = W_K × E
V = W_V × E
scores = Q × K^T
weights = softmax(scores / √d)
E' = weights × V # contextualized!
prediction = predict(E') # predict next token: "in"
A real LLM adds:
- Stack 96 layers - Do this attention block 96 times
- Multi-head attention - Run 8-96 different Q,K,V sets in parallel
- Feed-forward network - After attention, apply a neural net to each position
- Residual connections - E' = E + Attention(E)
- Positional encoding - Tell the model word ORDER
Step 12: How Does Q&A and Reasoning Work?
LLMs don't "understand" questions. They predict the next token based on patterns in training data.
Example:
Your prompt: "What does Crio help you do? Answer:"
Model predicts: "upskill" (because that pattern appeared in training data)
What about reasoning? Chain-of-thought helps:
Direct: Question → Answer (hard jump)
With CoT: Question → Step 1 → Step 2 → Answer (each step in context)
Each intermediate step becomes context for the next prediction. Easier than one giant leap!
📐 The Complete Attention Formula
TL;DR: Q and K find matches (who should attend to whom), V provides the content to gather. The weight matrices W_Q, W_K, W_V are learned during training. Stack this 96 times with multi-head attention, and you have a transformer!