AIAYN Part 2: Scaled Dot-Product Attention ~ kitxor

The Problem

Tokens need context from other tokens. The word "cat" needs to know about "sat" to understand its role in the sentence.

Without attention: each token processed independently. Meaning lost.

With attention: each token looks at all others, takes what it needs.

The Pattern: Q, K, V

Scaled dot-product attention uses three matrices:

Q (Query): "What pattern am I looking for?"
K (Key): "What pattern do I match?"
V (Value): "What information do I carry?"

Think: database lookup. Query searches Keys, retrieves Values.

Where Do Q, K, V Come From?

Q, K, V aren't given - they're computed from input embeddings using learned weight matrices.

The Setup

Given input token embeddings X (shape: seq_len × d_model):

Q = X·W^Q
K = X·W^K
V = X·W^V

Where:

X: Token embeddings (seq_len × d_model)
- Each row = one token's vector representation
- Example: "The cat sat" → 3 rows of 512-dimensional vectors
W matrices: Linear projection matrices (the core learnable parameters)
- W^Q: d_model × d_k (e.g., 512 × 64)
- W^K: d_model × d_k (e.g., 512 × 64)
- W^V: d_model × d_v (e.g., 512 × 64)

How Are W Matrices "Designed"?

They're NOT designed - they're randomly initialized and learned through backpropagation.

# Initialization (Xavier/He init)
W_Q = random_normal(mean=0, std=sqrt(2/d_model))
W_K = random_normal(mean=0, std=sqrt(2/d_model))
W_V = random_normal(mean=0, std=sqrt(2/d_model))

During training, a particular attention head's Q/K/V learns to capture specific patterns:

Q: queries for verbs given a subject
K: advertises verb-ness
V: propagates verb information

Multi-Head Attention

Different heads learn different patterns. Multi-head attention = multiple sets of Q, K, V matrices.

Single head:

One W^Q, one W^K, one W^V
Learns one pattern (e.g., subject-verb relations)

Multi-head (h=8 in the paper):

Head 1: W₁^Q, W₁^K, W₁^V → might learn subject-verb
Head 2: W₂^Q, W₂^K, W₂^V → might learn positional patterns
Head 3: W₃^Q, W₃^K, W₃^V → might learn semantic similarity
... (8 heads total)

The Algorithm

Four steps to compute attention:

Attention(Q, K, V) = softmax(Q·K^T / sqrt(d_k))·V

scores = Q·K^T → Compute similarity between queries and keys
scaled = scores / sqrt(d_k) → Prevent large values from dominating
weights = softmax(scaled) → Convert to probabilities (per row)
output = weights·V → Weighted mix of values

Why scale by sqrt(d_k)?

Dot products grow with dimension. For d_k = 512:

Unscaled: dot products can reach ±100+
Softmax gets extreme: [0.0001, 0.9999] → gradient vanishing
Scaled: divide by sqrt(512) ≈ 22.6 → keeps values reasonable

Example: "cat sat"

Two tokens, d_k = 4 (tiny for clarity).

Step 0: Setup

Tokens: ["cat", "sat"]

Q = [[1, 0, 1, 0],    ← cat's query
     [0, 1, 0, 1]]    ← sat's query

K = [[1, 0, 1, 0],    ← cat's key
     [0, 1, 0, 1]]    ← sat's key

V = [[2, 3],          ← cat's value
     [5, 7]]          ← sat's value

Step 1: scores = Q·K^T

Q·K^T = [[1,0,1,0],   [[1,0],
         [0,1,0,1]] ·  [0,1],
                       [1,0],
                       [0,1]]

     = [[1·1 + 0·0 + 1·1 + 0·0,  1·0 + 0·1 + 1·0 + 0·1],
        [0·1 + 1·0 + 0·1 + 1·0,  0·0 + 1·1 + 0·0 + 1·1]]

scores = [[2, 0],     ← cat: matches cat strongly, sat not at all
          [0, 2]]     ← sat: matches sat strongly, cat not at all

Step 2: scaled = scores / sqrt(4) = scores / 2

scaled = [[1.0, 0.0],
          [0.0, 1.0]]

Step 3: weights = softmax(scaled)

For row 0: [1.0, 0.0]
  e^1.0 = 2.72, e^0.0 = 1.00
  sum = 3.72
  weights[0] = [2.72/3.72, 1.00/3.72] = [0.73, 0.27]

For row 1: [0.0, 1.0]
  weights[1] = [0.27, 0.73]

weights = [[0.73, 0.27],    ← cat: 73% self, 27% sat
           [0.27, 0.73]]    ← sat: 27% cat, 73% self

Step 4: output = weights·V

output = [[0.73, 0.27],   [[2, 3],
          [0.27, 0.73]] ·  [5, 7]]

       = [[0.73·2 + 0.27·5,  0.73·3 + 0.27·7],
          [0.27·2 + 0.73·5,  0.27·3 + 0.73·7]]

output = [[2.81, 4.08],    ← cat enriched with sat's context
          [4.19, 5.92]]    ← sat enriched with cat's context

The Result

Each token gets a context-enriched representation:

Cat's output: mostly its own value [2,3], mixed with sat's [5,7]
Sat's output: mostly its own value [5,7], mixed with cat's [2,3]

Attention weights learned during training determine what context matters.

Key Takeaways

Q, K, V: query what you need, match keys, retrieve values
Dot product measures similarity
Scaling prevents softmax saturation
Output: weighted combination of all token values
Each token decides what context to pull from others

Next: Part 3

Multi-head attention:

Why one attention head isn't enough
Parallel attention subspaces
How heads learn different patterns

back to all posts