back to home

Why Transformers Killed RNNs

"Attention Is All You Need" - the paper title wasn't hype, it was accurate. RNNs are done.

The Sequential Trap

Seq2seq modeling ran on RNNs - LSTMs, GRUs, sometimes CNNs. All sequential. Each token waited for the previous one. ht depended on ht-1. You couldn't parallelize within sequences.

Encoder-decoder architectures used RNNs for everything: building representations (encoding) and modeling dependencies (decoding). Attention was just a helper mechanism bolted on top to connect the two.

The Bottleneck

Long sequences killed performance. Token 1 reaching token 100 required 99 intermediate steps. Information degraded through multiple hops. A 512-token sequence needed 512 sequential operations.

Attention helped - let decoders peek at encoder states directly - but RNNs still did the heavy lifting. The sequential chain remained.

The Insight

What if attention wasn't a helper? What if it was everything?

Self-attention: all tokens attend to all tokens in parallel. No sequential dependency chain. Full GPU utilization. Transformers handle both computation and dependency modeling without a single RNN.

How Self-Attention Works

At the core of self-attention are learned weight matrices that transform input tokens:

These matrices create three representations:

What's dmodel? It's the embedding space - the size of representation throughout the transformer. Every token, at every layer, maintains this same dimensionality. In the original paper, dmodel = 512.

The Result

Transformers made RNNs mostly irrelevant for NLP. GPT, BERT, every LLM you use today - all transformers. Faster training, perfect parallelization, no distance decay.

The sequential bottleneck was dead. Attention was all you needed.

back to all posts