"Attention Is All You Need" - the paper title wasn't hype, it was accurate. RNNs are done.
The Sequential Trap
Seq2seq modeling ran on RNNs - LSTMs, GRUs, sometimes CNNs. All sequential. Each token waited for the previous one. ht depended on ht-1. You couldn't parallelize within sequences.
Encoder-decoder architectures used RNNs for everything: building representations (encoding) and modeling dependencies (decoding). Attention was just a helper mechanism bolted on top to connect the two.
The Bottleneck
Long sequences killed performance. Token 1 reaching token 100 required 99 intermediate steps. Information degraded through multiple hops. A 512-token sequence needed 512 sequential operations.
Attention helped - let decoders peek at encoder states directly - but RNNs still did the heavy lifting. The sequential chain remained.
The Insight
What if attention wasn't a helper? What if it was everything?
Self-attention: all tokens attend to all tokens in parallel. No sequential dependency chain. Full GPU utilization. Transformers handle both computation and dependency modeling without a single RNN.
How Self-Attention Works
At the core of self-attention are learned weight matrices that transform input tokens:
- X = input token (dmodel dimension)
- Wq, Wk, Wv = learned weight matrices
These matrices create three representations:
- Q = XWq (queries: what I'm looking for)
- K = XWk (keys: what I contain)
- V = XWv (values: what I'll pass forward)
What's dmodel? It's the embedding space - the size of representation throughout the transformer. Every token, at every layer, maintains this same dimensionality. In the original paper, dmodel = 512.
The Result
Transformers made RNNs mostly irrelevant for NLP. GPT, BERT, every LLM you use today - all transformers. Faster training, perfect parallelization, no distance decay.
The sequential bottleneck was dead. Attention was all you needed.