back to home

AIAYN Part 2: Scaled Dot-Product Attention

The Problem

Tokens need context from other tokens. The word "cat" needs to know about "sat" to understand its role in the sentence.

Without attention: each token processed independently. Meaning lost.

With attention: each token looks at all others, takes what it needs.

The Pattern: Q, K, V

Scaled dot-product attention uses three matrices:

Think: database lookup. Query searches Keys, retrieves Values.

Where Do Q, K, V Come From?

Q, K, V aren't given - they're computed from input embeddings using learned weight matrices.

The Setup

Given input token embeddings X (shape: seq_len × d_model):

Q = X·W^Q
K = X·W^K
V = X·W^V

Where:

How Are W Matrices "Designed"?

They're NOT designed - they're randomly initialized and learned through backpropagation.

# Initialization (Xavier/He init)
W_Q = random_normal(mean=0, std=sqrt(2/d_model))
W_K = random_normal(mean=0, std=sqrt(2/d_model))
W_V = random_normal(mean=0, std=sqrt(2/d_model))

During training, a particular attention head's Q/K/V learns to capture specific patterns:

Multi-Head Attention

Different heads learn different patterns. Multi-head attention = multiple sets of Q, K, V matrices.

Single head:

Multi-head (h=8 in the paper):

The Algorithm

Four steps to compute attention:

Attention(Q, K, V) = softmax(Q·K^T / sqrt(d_k))·V
  1. scores = Q·K^T → Compute similarity between queries and keys
  2. scaled = scores / sqrt(d_k) → Prevent large values from dominating
  3. weights = softmax(scaled) → Convert to probabilities (per row)
  4. output = weights·V → Weighted mix of values

Why scale by sqrt(d_k)?

Dot products grow with dimension. For d_k = 512:

Example: "cat sat"

Two tokens, d_k = 4 (tiny for clarity).

Step 0: Setup

Tokens: ["cat", "sat"]

Q = [[1, 0, 1, 0],    ← cat's query
     [0, 1, 0, 1]]    ← sat's query

K = [[1, 0, 1, 0],    ← cat's key
     [0, 1, 0, 1]]    ← sat's key

V = [[2, 3],          ← cat's value
     [5, 7]]          ← sat's value

Step 1: scores = Q·K^T

Q·K^T = [[1,0,1,0],   [[1,0],
         [0,1,0,1]] ·  [0,1],
                       [1,0],
                       [0,1]]

     = [[1·1 + 0·0 + 1·1 + 0·0,  1·0 + 0·1 + 1·0 + 0·1],
        [0·1 + 1·0 + 0·1 + 1·0,  0·0 + 1·1 + 0·0 + 1·1]]

scores = [[2, 0],     ← cat: matches cat strongly, sat not at all
          [0, 2]]     ← sat: matches sat strongly, cat not at all

Step 2: scaled = scores / sqrt(4) = scores / 2

scaled = [[1.0, 0.0],
          [0.0, 1.0]]

Step 3: weights = softmax(scaled)

For row 0: [1.0, 0.0]
  e^1.0 = 2.72, e^0.0 = 1.00
  sum = 3.72
  weights[0] = [2.72/3.72, 1.00/3.72] = [0.73, 0.27]

For row 1: [0.0, 1.0]
  weights[1] = [0.27, 0.73]

weights = [[0.73, 0.27],    ← cat: 73% self, 27% sat
           [0.27, 0.73]]    ← sat: 27% cat, 73% self

Step 4: output = weights·V

output = [[0.73, 0.27],   [[2, 3],
          [0.27, 0.73]] ·  [5, 7]]

       = [[0.73·2 + 0.27·5,  0.73·3 + 0.27·7],
          [0.27·2 + 0.73·5,  0.27·3 + 0.73·7]]

output = [[2.81, 4.08],    ← cat enriched with sat's context
          [4.19, 5.92]]    ← sat enriched with cat's context

The Result

Each token gets a context-enriched representation:

Attention weights learned during training determine what context matters.

Key Takeaways

Next: Part 3

Multi-head attention:

back to all posts