The Problem
Tokens need context from other tokens. The word "cat" needs to know about "sat" to understand its role in the sentence.
Without attention: each token processed independently. Meaning lost.
With attention: each token looks at all others, takes what it needs.
The Pattern: Q, K, V
Scaled dot-product attention uses three matrices:
- Q (Query): "What pattern am I looking for?"
- K (Key): "What pattern do I match?"
- V (Value): "What information do I carry?"
Think: database lookup. Query searches Keys, retrieves Values.
Where Do Q, K, V Come From?
Q, K, V aren't given - they're computed from input embeddings using learned weight matrices.
The Setup
Given input token embeddings X (shape: seq_len × d_model):
K = X·W^K
V = X·W^V
Where:
- X: Token embeddings (seq_len × d_model)
- Each row = one token's vector representation
- Example: "The cat sat" → 3 rows of 512-dimensional vectors
- W matrices: Linear projection matrices (the core learnable parameters)
- W^Q: d_model × d_k (e.g., 512 × 64)
- W^K: d_model × d_k (e.g., 512 × 64)
- W^V: d_model × d_v (e.g., 512 × 64)
How Are W Matrices "Designed"?
They're NOT designed - they're randomly initialized and learned through backpropagation.
# Initialization (Xavier/He init)
W_Q = random_normal(mean=0, std=sqrt(2/d_model))
W_K = random_normal(mean=0, std=sqrt(2/d_model))
W_V = random_normal(mean=0, std=sqrt(2/d_model))
During training, a particular attention head's Q/K/V learns to capture specific patterns:
- Q: queries for verbs given a subject
- K: advertises verb-ness
- V: propagates verb information
Multi-Head Attention
Different heads learn different patterns. Multi-head attention = multiple sets of Q, K, V matrices.
Single head:
- One W^Q, one W^K, one W^V
- Learns one pattern (e.g., subject-verb relations)
Multi-head (h=8 in the paper):
- Head 1: W₁^Q, W₁^K, W₁^V → might learn subject-verb
- Head 2: W₂^Q, W₂^K, W₂^V → might learn positional patterns
- Head 3: W₃^Q, W₃^K, W₃^V → might learn semantic similarity
- ... (8 heads total)
The Algorithm
Four steps to compute attention:
- scores = Q·K^T → Compute similarity between queries and keys
- scaled = scores / sqrt(d_k) → Prevent large values from dominating
- weights = softmax(scaled) → Convert to probabilities (per row)
- output = weights·V → Weighted mix of values
Why scale by sqrt(d_k)?
Dot products grow with dimension. For d_k = 512:
- Unscaled: dot products can reach ±100+
- Softmax gets extreme: [0.0001, 0.9999] → gradient vanishing
- Scaled: divide by sqrt(512) ≈ 22.6 → keeps values reasonable
Example: "cat sat"
Two tokens, d_k = 4 (tiny for clarity).
Step 0: Setup
Tokens: ["cat", "sat"]
Q = [[1, 0, 1, 0], ← cat's query
[0, 1, 0, 1]] ← sat's query
K = [[1, 0, 1, 0], ← cat's key
[0, 1, 0, 1]] ← sat's key
V = [[2, 3], ← cat's value
[5, 7]] ← sat's value
Step 1: scores = Q·K^T
Q·K^T = [[1,0,1,0], [[1,0],
[0,1,0,1]] · [0,1],
[1,0],
[0,1]]
= [[1·1 + 0·0 + 1·1 + 0·0, 1·0 + 0·1 + 1·0 + 0·1],
[0·1 + 1·0 + 0·1 + 1·0, 0·0 + 1·1 + 0·0 + 1·1]]
scores = [[2, 0], ← cat: matches cat strongly, sat not at all
[0, 2]] ← sat: matches sat strongly, cat not at all
Step 2: scaled = scores / sqrt(4) = scores / 2
scaled = [[1.0, 0.0],
[0.0, 1.0]]
Step 3: weights = softmax(scaled)
For row 0: [1.0, 0.0]
e^1.0 = 2.72, e^0.0 = 1.00
sum = 3.72
weights[0] = [2.72/3.72, 1.00/3.72] = [0.73, 0.27]
For row 1: [0.0, 1.0]
weights[1] = [0.27, 0.73]
weights = [[0.73, 0.27], ← cat: 73% self, 27% sat
[0.27, 0.73]] ← sat: 27% cat, 73% self
Step 4: output = weights·V
output = [[0.73, 0.27], [[2, 3],
[0.27, 0.73]] · [5, 7]]
= [[0.73·2 + 0.27·5, 0.73·3 + 0.27·7],
[0.27·2 + 0.73·5, 0.27·3 + 0.73·7]]
output = [[2.81, 4.08], ← cat enriched with sat's context
[4.19, 5.92]] ← sat enriched with cat's context
The Result
Each token gets a context-enriched representation:
- Cat's output: mostly its own value [2,3], mixed with sat's [5,7]
- Sat's output: mostly its own value [5,7], mixed with cat's [2,3]
Attention weights learned during training determine what context matters.
Key Takeaways
- Q, K, V: query what you need, match keys, retrieve values
- Dot product measures similarity
- Scaling prevents softmax saturation
- Output: weighted combination of all token values
- Each token decides what context to pull from others
Next: Part 3
Multi-head attention:
- Why one attention head isn't enough
- Parallel attention subspaces
- How heads learn different patterns