Attention Is All You Need

Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transf

papers.baulab.info

Gist

1.
Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transformer, now powers ChatGPT, Claude, and Gemini. Its secret: throw out everything the field knew about sequence processing and replace it with a single mechanism called attention.

Logic

2.
Recurrent networks read one word at a time — and can't be sped up

RNNs generate each hidden state h_t as a function of the previous state h_t−1, creating a chain that must execute sequentially — you literally cannot compute step 50 before step 49
This sequential bottleneck "becomes critical at longer sequence lengths, as memory constraints limit batching across examples"
Prior work on factorization tricks and conditional computation improved efficiency but "the fundamental constraint of sequential computation, however, remains"

3.
Self-attention wins on all three metrics that matter: speed, parallelism, and reach

Computational complexity: self-attention is O(n²·d) per layer versus O(n·d²) for recurrence — faster whenever sequence length n < representation dimension d, "which is most often the case" in practice
Parallelism: self-attention requires O(1) sequential operations per layer; recurrence requires O(n) — the difference between one step and thousands
Long-range dependencies: self-attention connects any two positions in O(1) path length; recurrence requires O(n) hops, convolution requires O(log_k(n)) — making distant dependencies trivial to learn

4.
The Transformer is a complete architecture, not just an attention trick

Encoder: 6 identical layers, each combining multi-head self-attention with a position-wise feed-forward network (FFN(x) = max(0, xW₁+b₁)W₂+b₂), wrapped in residual connections and layer normalization
Decoder: 6 matching layers adding a third sub-layer for encoder-decoder attention, with future positions masked by setting softmax inputs to −∞ to preserve auto-regressive generation
Multi-head attention splits queries, keys, and values into h=8 parallel heads of dimension 64, letting the model "jointly attend to information from different representation subspaces at different positions"
Sinusoidal positional encodings inject word order without recurrence — wavelengths form a geometric progression from 2π to 10000·2π, enabling the model to learn relative positions as linear functions

5.
28.4 BLEU on English-German: better than every model and every ensemble, at a fraction of the cost

Transformer (big) scored 28.4 BLEU on WMT 2014 EN-DE, beating the best previous ensemble (ConvS2S at 26.36) by over 2 full BLEU points — a gap that normally takes years to close
On EN-FR, it hit 41.8 BLEU, surpassing all published single models "at less than 1/4 the training cost of the previous state-of-the-art"
The base model alone — 65M parameters, 12 hours on 8 P100 GPUs, 3.3×10¹⁸ FLOPs — "surpasses all previously published models and ensembles" on EN-DE

6.
Same architecture, no tuning, near-SOTA on a completely different task

A 4-layer Transformer trained on just 40K Penn Treebank sentences scored 91.3 F1 on English constituency parsing — beating BerkeleyParser (90.4) and nearly matching the specialized Recurrent Neural Network Grammar (91.7)
With semi-supervised data (~17M sentences), it reached 92.7 F1, exceeding all models except two task-specific architectures
The authors "performed only a small number of experiments" to adapt hyperparameters — beam size, dropout, learning rate — leaving all other parameters unchanged from the translation model

7.
Every design choice survives ablation — the architecture earns its complexity

Single-head attention costs 0.9 BLEU versus 8 heads; too many heads (32) also degrades quality — the sweet spot is empirically precise
Dropping dropout entirely (P_drop=0.0) costs 1.2 BLEU, from 25.8 to 24.6 — regularization is load-bearing, not optional
Reducing model depth from 6 layers to 2 drops BLEU from 25.8 to 23.7; scaling d_model to 1024 pushes it to 26.0 — bigger is reliably better
Learned positional embeddings versus sinusoidal: 25.7 versus 25.8 BLEU — "nearly identical," vindicating the simpler, extrapolation-friendly choice

Counter-Argument

8.
The architecture that "replaces" sequential processing still generates one word at a time

The parallelization advantage is a training-time benefit. At inference, the decoder remains auto-regressive — each token depends on every token before it. The paper's own conclusion admits "making generation less sequential is another research goal."
Self-attention's O(n²·d) complexity means cost scales quadratically with sequence length. The authors acknowledge it's faster than recurrence only when n < d, and their proposed fix — restricted attention with neighborhood size r — is explicitly deferred: "We plan to investigate this approach further in future work."
The entire evaluation covers two European translation pairs and one English parsing task. No language modeling, no summarization, no question answering, no typologically distant languages. The "generalization" claim rests on a single non-translation experiment with a 4-layer model that didn't reach state of the art.

Steelman

9.
The paper's real contribution isn't the architecture — it's the economics of intelligence

Both the original argument (attention beats recurrence on three desiderata) and the counter-argument (quadratic scaling, sequential inference) share a hidden assumption: that the layer type is what matters. It isn't. What matters is that the Transformer made training large language models economically viable for the first time.
The 12-hours-on-8-GPUs result didn't just beat a benchmark — it broke a cost barrier. Every model the reader has touched (GPT-4, Claude, Gemini) exists because this architecture made scale affordable. The quadratic wall is real; it's also irrelevant for the sequence lengths where language actually lives.
The paper's authors wrote "we are excited about the future of attention-based models" — the most understated sentence in the history of computer science. They didn't just propose a new layer type. They handed the field a printing press and called it a font.

Gist

Logic

2. Recurrent networks read one word at a time — and can't be sped up

3. Self-attention wins on all three metrics that matter: speed, parallelism, and reach

4. The Transformer is a complete architecture, not just an attention trick

5. 28.4 BLEU on English-German: better than every model and every ensemble, at a fraction of the cost

6. Same architecture, no tuning, near-SOTA on a completely different task

7. Every design choice survives ablation — the architecture earns its complexity