Eight
Full transcript (Instant)

Attention Is All You Need

Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transf

papers.baulab.info

Gist

1.

Original

Continue Reading

Full transcript (Deep)

Attention Is All You Need

Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transf

papers.baulab.info

Gist

1.

Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transformer, now powers ChatGPT, Claude, and Gemini. Its secret: throw out everything the field knew about sequence processing and replace it with a single mechanism called attention.

Logic

2.

Recurrent networks read one word at a time — and can't be sped up

  • RNNs generate each hidden state h_t as a function of the previous state h_t−1, creating a chain that must execute sequentially — you literally cannot compute step 50 before step 49
  • This sequential bottleneck "becomes critical at longer sequence lengths, as memory constraints limit batching across examples"
  • Prior work on factorization tricks and conditional computation improved efficiency but "the fundamental constraint of sequential computation, however, remains"

3.

Self-attention wins on all three metrics that matter: speed, parallelism, and reach

  • Computational complexity: self-attention is O(n²·d) per layer versus O(n·d²) for recurrence — faster whenever sequence length n < representation dimension d, "which is most often the case" in practice
  • Parallelism: self-attention requires O(1) sequential operations per layer; recurrence requires O(n) — the difference between one step and thousands
  • Long-range dependencies: self-attention connects any two positions in O(1) path length; recurrence requires O(n) hops, convolution requires O(log_k(n)) — making distant dependencies trivial to learn

4.

The Transformer is a complete architecture, not just an attention trick

  • Encoder: 6 identical layers, each combining multi-head self-attention with a position-wise feed-forward network (FFN(x) = max(0, xW₁+b₁)W₂+b₂), wrapped in residual connections and layer normalization
  • Decoder: 6 matching layers adding a third sub-layer for encoder-decoder attention, with future positions masked by setting softmax inputs to −∞ to preserve auto-regressive generation
  • Multi-head attention splits queries, keys, and values into h=8 parallel heads of dimension 64, letting the model "jointly attend to information from different representation subspaces at different positions"
  • Sinusoidal positional encodings inject word order without recurrence — wavelengths form a geometric progression from 2π to 10000·2π, enabling the model to learn relative positions as linear functions

5.

28.4 BLEU on English-German: better than every model and every ensemble, at a fraction of the cost

  • Transformer (big) scored 28.4 BLEU on WMT 2014 EN-DE, beating the best previous ensemble (ConvS2S at 26.36) by over 2 full BLEU points — a gap that normally takes years to close
  • On EN-FR, it hit 41.8 BLEU, surpassing all published single models "at less than 1/4 the training cost of the previous state-of-the-art"
  • The base model alone — 65M parameters, 12 hours on 8 P100 GPUs, 3.3×10¹⁸ FLOPs — "surpasses all previously published models and ensembles" on EN-DE

6.

Same architecture, no tuning, near-SOTA on a completely different task

  • A 4-layer Transformer trained on just 40K Penn Treebank sentences scored 91.3 F1 on English constituency parsing — beating BerkeleyParser (90.4) and nearly matching the specialized Recurrent Neural Network Grammar (91.7)
  • With semi-supervised data (~17M sentences), it reached 92.7 F1, exceeding all models except two task-specific architectures
  • The authors "performed only a small number of experiments" to adapt hyperparameters — beam size, dropout, learning rate — leaving all other parameters unchanged from the translation model

7.

Every design choice survives ablation — the architecture earns its complexity

  • Single-head attention costs 0.9 BLEU versus 8 heads; too many heads (32) also degrades quality — the sweet spot is empirically precise
  • Dropping dropout entirely (P_drop=0.0) costs 1.2 BLEU, from 25.8 to 24.6 — regularization is load-bearing, not optional
  • Reducing model depth from 6 layers to 2 drops BLEU from 25.8 to 23.7; scaling d_model to 1024 pushes it to 26.0 — bigger is reliably better
  • Learned positional embeddings versus sinusoidal: 25.7 versus 25.8 BLEU — "nearly identical," vindicating the simpler, extrapolation-friendly choice

Counter-Argument

8.

The architecture that "replaces" sequential processing still generates one word at a time

  • The parallelization advantage is a training-time benefit. At inference, the decoder remains auto-regressive — each token depends on every token before it. The paper's own conclusion admits "making generation less sequential is another research goal."
  • Self-attention's O(n²·d) complexity means cost scales quadratically with sequence length. The authors acknowledge it's faster than recurrence only when n < d, and their proposed fix — restricted attention with neighborhood size r — is explicitly deferred: "We plan to investigate this approach further in future work."
  • The entire evaluation covers two European translation pairs and one English parsing task. No language modeling, no summarization, no question answering, no typologically distant languages. The "generalization" claim rests on a single non-translation experiment with a 4-layer model that didn't reach state of the art.

Steelman

9.

The paper's real contribution isn't the architecture — it's the economics of intelligence

  • Both the original argument (attention beats recurrence on three desiderata) and the counter-argument (quadratic scaling, sequential inference) share a hidden assumption: that the layer type is what matters. It isn't. What matters is that the Transformer made training large language models economically viable for the first time.
  • The 12-hours-on-8-GPUs result didn't just beat a benchmark — it broke a cost barrier. Every model the reader has touched (GPT-4, Claude, Gemini) exists because this architecture made scale affordable. The quadratic wall is real; it's also irrelevant for the sequence lengths where language actually lives.
  • The paper's authors wrote "we are excited about the future of attention-based models" — the most understated sentence in the history of computer science. They didn't just propose a new layer type. They handed the field a printing press and called it a font.

Original

Continue Reading

Transcript

Attention Is All You Need

Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transf

papers.baulab.info

Gist

1.

Eight Google researchers trained a model on 8 GPUs for 12 hours and beat every neural translation system ever built — including teams that trained for weeks. The architecture they invented, the Transformer, now powers ChatGPT, Claude, and Gemini. Its secret: throw out everything the field knew about sequence processing and replace it with a single mechanism called attention.

Logic

2.

Recurrent networks read one word at a time — and can't be sped up

  • RNNs generate each hidden state h_t as a function of the previous state h_t−1, creating a chain that must execute sequentially — you literally cannot compute step 50 before step 49
  • This sequential bottleneck "becomes critical at longer sequence lengths, as memory constraints limit batching across examples"
  • Prior work on factorization tricks and conditional computation improved efficiency but "the fundamental constraint of sequential computation, however, remains"

3.

Self-attention wins on all three metrics that matter: speed, parallelism, and reach

  • Computational complexity: self-attention is O(n²·d) per layer versus O(n·d²) for recurrence — faster whenever sequence length n < representation dimension d, "which is most often the case" in practice
  • Parallelism: self-attention requires O(1) sequential operations per layer; recurrence requires O(n) — the difference between one step and thousands
  • Long-range dependencies: self-attention connects any two positions in O(1) path length; recurrence requires O(n) hops, convolution requires O(log_k(n)) — making distant dependencies trivial to learn

4.

The Transformer is a complete architecture, not just an attention trick

  • Encoder: 6 identical layers, each combining multi-head self-attention with a position-wise feed-forward network (FFN(x) = max(0, xW₁+b₁)W₂+b₂), wrapped in residual connections and layer normalization
  • Decoder: 6 matching layers adding a third sub-layer for encoder-decoder attention, with future positions masked by setting softmax inputs to −∞ to preserve auto-regressive generation
  • Multi-head attention splits queries, keys, and values into h=8 parallel heads of dimension 64, letting the model "jointly attend to information from different representation subspaces at different positions"
  • Sinusoidal positional encodings inject word order without recurrence — wavelengths form a geometric progression from 2π to 10000·2π, enabling the model to learn relative positions as linear functions

5.

28.4 BLEU on English-German: better than every model and every ensemble, at a fraction of the cost

  • Transformer (big) scored 28.4 BLEU on WMT 2014 EN-DE, beating the best previous ensemble (ConvS2S at 26.36) by over 2 full BLEU points — a gap that normally takes years to close
  • On EN-FR, it hit 41.8 BLEU, surpassing all published single models "at less than 1/4 the training cost of the previous state-of-the-art"
  • The base model alone — 65M parameters, 12 hours on 8 P100 GPUs, 3.3×10¹⁸ FLOPs — "surpasses all previously published models and ensembles" on EN-DE

6.

Same architecture, no tuning, near-SOTA on a completely different task

  • A 4-layer Transformer trained on just 40K Penn Treebank sentences scored 91.3 F1 on English constituency parsing — beating BerkeleyParser (90.4) and nearly matching the specialized Recurrent Neural Network Grammar (91.7)
  • With semi-supervised data (~17M sentences), it reached 92.7 F1, exceeding all models except two task-specific architectures
  • The authors "performed only a small number of experiments" to adapt hyperparameters — beam size, dropout, learning rate — leaving all other parameters unchanged from the translation model

7.

Every design choice survives ablation — the architecture earns its complexity

  • Single-head attention costs 0.9 BLEU versus 8 heads; too many heads (32) also degrades quality — the sweet spot is empirically precise
  • Dropping dropout entirely (P_drop=0.0) costs 1.2 BLEU, from 25.8 to 24.6 — regularization is load-bearing, not optional
  • Reducing model depth from 6 layers to 2 drops BLEU from 25.8 to 23.7; scaling d_model to 1024 pushes it to 26.0 — bigger is reliably better
  • Learned positional embeddings versus sinusoidal: 25.7 versus 25.8 BLEU — "nearly identical," vindicating the simpler, extrapolation-friendly choice

Counter-Argument

8.

The architecture that "replaces" sequential processing still generates one word at a time

  • The parallelization advantage is a training-time benefit. At inference, the decoder remains auto-regressive — each token depends on every token before it. The paper's own conclusion admits "making generation less sequential is another research goal."
  • Self-attention's O(n²·d) complexity means cost scales quadratically with sequence length. The authors acknowledge it's faster than recurrence only when n < d, and their proposed fix — restricted attention with neighborhood size r — is explicitly deferred: "We plan to investigate this approach further in future work."
  • The entire evaluation covers two European translation pairs and one English parsing task. No language modeling, no summarization, no question answering, no typologically distant languages. The "generalization" claim rests on a single non-translation experiment with a 4-layer model that didn't reach state of the art.

Steelman

9.

The paper's real contribution isn't the architecture — it's the economics of intelligence

  • Both the original argument (attention beats recurrence on three desiderata) and the counter-argument (quadratic scaling, sequential inference) share a hidden assumption: that the layer type is what matters. It isn't. What matters is that the Transformer made training large language models economically viable for the first time.
  • The 12-hours-on-8-GPUs result didn't just beat a benchmark — it broke a cost barrier. Every model the reader has touched (GPT-4, Claude, Gemini) exists because this architecture made scale affordable. The quadratic wall is real; it's also irrelevant for the sequence lengths where language actually lives.
  • The paper's authors wrote "we are excited about the future of attention-based models" — the most understated sentence in the history of computer science. They didn't just propose a new layer type. They handed the field a printing press and called it a font.

Original

Continue Reading