Why Transformers Killed RNNs ~ kitxor

"Attention Is All You Need" - the paper title wasn't hype, it was accurate. RNNs are done.

The Sequential Trap

Seq2seq modeling ran on RNNs - LSTMs, GRUs, sometimes CNNs. All sequential. Each token waited for the previous one. h_t depended on h_t-1. You couldn't parallelize within sequences.

Encoder-decoder architectures used RNNs for everything: building representations (encoding) and modeling dependencies (decoding). Attention was just a helper mechanism bolted on top to connect the two.

The Bottleneck

Long sequences killed performance. Token 1 reaching token 100 required 99 intermediate steps. Information degraded through multiple hops. A 512-token sequence needed 512 sequential operations.

Attention helped - let decoders peek at encoder states directly - but RNNs still did the heavy lifting. The sequential chain remained.

The Insight

What if attention wasn't a helper? What if it was everything?

Self-attention: all tokens attend to all tokens in parallel. No sequential dependency chain. Full GPU utilization. Transformers handle both computation and dependency modeling without a single RNN.

How Self-Attention Works

At the core of self-attention are learned weight matrices that transform input tokens:

X = input token (d_model dimension)
W_q, W_k, W_v = learned weight matrices

These matrices create three representations:

Q = XW_q (queries: what I'm looking for)
K = XW_k (keys: what I contain)
V = XW_v (values: what I'll pass forward)

What's d_model? It's the embedding space - the size of representation throughout the transformer. Every token, at every layer, maintains this same dimensionality. In the original paper, d_model = 512.

The Result

Transformers made RNNs mostly irrelevant for NLP. GPT, BERT, every LLM you use today - all transformers. Faster training, perfect parallelization, no distance decay.

The sequential bottleneck was dead. Attention was all you needed.

Reference: Attention Is All You Need (Vaswani et al., 2017)

back to all posts