A
Full transcript (Instant)

What if Neural Networks had SVDs?

A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster tha

Unknown · proceedings.neurips.cc

Gist

1.

A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster than computing the SVD itself, has never worked in practice. FastH fixes it, speeding up matrix inversion 2.7× and determinant 3.5× on a single RTX 2080 Ti.

Logic

2.

The SVD unlocks operations that standard methods can't

  • Matrix determinant, inverse, exponential, Cayley map, spectral normalization, and low-rank compression all become faster when you have the SVD of the weight matrix W = UΣVᵀ
  • Standard methods like TORCH.INVERSE and Padé approximation run in O(d³) time; SVD-based methods drop to O(d²)
  • On a d × d matrix, computing the SVD itself takes O(d³) — no faster than the operations it's supposed to accelerate

3.

The SVD reparameterization avoids computing the SVD — in theory

  • Zhang et al. (ICML 2018) decompose orthogonal matrices U and V into products of d Householder matrices H₁, ..., H_d, allowing gradient descent to preserve orthogonality without explicit SVD computation
  • Each Householder multiplication takes O(dm) time for a mini-batch of size m; d sequential multiplications yield O(d²m) total time — asymptotically faster than O(d³)
  • The reparameterization was originally developed to force singular values into [1 ± ε] for RNNs, not to speed up matrix operations

4.

Sequential inner products kill the speed-up on GPUs

  • Each Householder multiplication entails computing an inner product; d sequential multiplications require O(d) sequential inner products
  • A GPU with 4,000 cores computing inner products on 100-dimensional vectors can only utilize 100 cores simultaneously — 3,900 sit idle
  • The parallel algorithm from Zhang et al. takes O(d³) time, no faster than computing the SVD itself; the sequential algorithm crashes at d > 448

5.

FastH replaces sequential inner products with parallel matrix multiplications

  • The key is a 1987 result from Bischof and Van Loan: any product of m Householder matrices can be written as I − 2WYᵀ, where W and Y are d × m matrices computable in O(dm) time with m sequential Householder multiplications
  • FastH divides the d Householder matrices into d/m products Pᵢ, each of m matrices; all d/m products are computed in parallel
  • Forward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications instead of O(d) sequential inner products
  • Backward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications; gradients for each Householder vector are computed in parallel

6.

FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

  • At d = 448, FastH is roughly 25× faster than the sequential algorithm; at d = 3072, FastH is faster than the sequential algorithm at d = 448
  • At d > 500, relative improvement increases with d; at d = 3072, FastH is 6.2× faster than the parallel algorithm
  • On matrix operations at d = 768: 3.1× faster than Cayley map, 4.1× faster than matrix exponential, 2.7× faster than inverse, 3.5× faster than determinant — all measured as total time including forward pass, operation, and backward pass

7.

FastH computes the same output — no expressiveness trade-off

  • FastH computes H₁···H_d X, identical to the sequential algorithm; previous work often limited the number of Householder matrices to reduce computation, sacrificing expressiveness
  • The parameter k controls a time-parallelism trade-off: O(d²k + d²m) time, O(d/k + k) sequential operations, space O(d²m/k); k = O(√d) minimizes sequential work
  • Code is a single-line change from NN.LINEAR to LINEAR_SVD; implemented in CUDA because Python's parallelization was inadequate

Counter-Argument

8.

The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

  • At d = 768, FastH is 2.7× faster than TORCH.INVERSE — a factor of 2.7, not 27.1×; the 27.1× figure compares FastH to the sequential algorithm from Zhang et al., which the authors themselves call "not fast enough to speed up any matrix operation"
  • The Cayley map and matrix exponential approaches, which have desirable provable guarantees the Householder decomposition lacks, are only 3.1× and 4.1× slower — close enough that the theoretical gap in guarantees matters
  • The authors acknowledge the Householder decomposition's missing guarantees but offer no fix; the paper's central contribution is a faster way to do something that may be theoretically unsound

Steelman

9.

FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

  • Both the paper and its critics argue about benchmarks on a single GPU; they share the assumption that FastH's contribution is a faster algorithm — but the deeper move is making SVD-preserving gradient descent a drop-in replacement for NN.LINEAR
  • Before FastH, using the SVD reparameterization meant implementing Householder decompositions, WY representations, and parallel CUDA kernels by hand; after FastH, a PhD student with one RTX 2080 Ti can experiment with SVD-aware architectures without writing a single line of linear algebra
  • The 27.1× figure is a distraction; the 2.7× figure is a floor — the ceiling is the number of researchers who will now build SVD-aware networks because the barrier to entry has collapsed

Original

Continue Reading

Full transcript (Deep)

What if Neural Networks had SVDs?

A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster tha

Unknown · proceedings.neurips.cc

Gist

1.

Original

Continue Reading

Transcript

What if Neural Networks had SVDs?

A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster tha

Unknown · proceedings.neurips.cc

Gist

1.

A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster than computing the SVD itself, has never worked in practice. FastH fixes it, speeding up matrix inversion 2.7× and determinant 3.5× on a single RTX 2080 Ti.

Logic

2.

The SVD unlocks operations that standard methods can't

  • Matrix determinant, inverse, exponential, Cayley map, spectral normalization, and low-rank compression all become faster when you have the SVD of the weight matrix W = UΣVᵀ
  • Standard methods like TORCH.INVERSE and Padé approximation run in O(d³) time; SVD-based methods drop to O(d²)
  • On a d × d matrix, computing the SVD itself takes O(d³) — no faster than the operations it's supposed to accelerate

3.

The SVD reparameterization avoids computing the SVD — in theory

  • Zhang et al. (ICML 2018) decompose orthogonal matrices U and V into products of d Householder matrices H₁, ..., H_d, allowing gradient descent to preserve orthogonality without explicit SVD computation
  • Each Householder multiplication takes O(dm) time for a mini-batch of size m; d sequential multiplications yield O(d²m) total time — asymptotically faster than O(d³)
  • The reparameterization was originally developed to force singular values into [1 ± ε] for RNNs, not to speed up matrix operations

4.

Sequential inner products kill the speed-up on GPUs

  • Each Householder multiplication entails computing an inner product; d sequential multiplications require O(d) sequential inner products
  • A GPU with 4,000 cores computing inner products on 100-dimensional vectors can only utilize 100 cores simultaneously — 3,900 sit idle
  • The parallel algorithm from Zhang et al. takes O(d³) time, no faster than computing the SVD itself; the sequential algorithm crashes at d > 448

5.

FastH replaces sequential inner products with parallel matrix multiplications

  • The key is a 1987 result from Bischof and Van Loan: any product of m Householder matrices can be written as I − 2WYᵀ, where W and Y are d × m matrices computable in O(dm) time with m sequential Householder multiplications
  • FastH divides the d Householder matrices into d/m products Pᵢ, each of m matrices; all d/m products are computed in parallel
  • Forward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications instead of O(d) sequential inner products
  • Backward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications; gradients for each Householder vector are computed in parallel

6.

FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

  • At d = 448, FastH is roughly 25× faster than the sequential algorithm; at d = 3072, FastH is faster than the sequential algorithm at d = 448
  • At d > 500, relative improvement increases with d; at d = 3072, FastH is 6.2× faster than the parallel algorithm
  • On matrix operations at d = 768: 3.1× faster than Cayley map, 4.1× faster than matrix exponential, 2.7× faster than inverse, 3.5× faster than determinant — all measured as total time including forward pass, operation, and backward pass

7.

FastH computes the same output — no expressiveness trade-off

  • FastH computes H₁···H_d X, identical to the sequential algorithm; previous work often limited the number of Householder matrices to reduce computation, sacrificing expressiveness
  • The parameter k controls a time-parallelism trade-off: O(d²k + d²m) time, O(d/k + k) sequential operations, space O(d²m/k); k = O(√d) minimizes sequential work
  • Code is a single-line change from NN.LINEAR to LINEAR_SVD; implemented in CUDA because Python's parallelization was inadequate

Counter-Argument

8.

The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

  • At d = 768, FastH is 2.7× faster than TORCH.INVERSE — a factor of 2.7, not 27.1×; the 27.1× figure compares FastH to the sequential algorithm from Zhang et al., which the authors themselves call "not fast enough to speed up any matrix operation"
  • The Cayley map and matrix exponential approaches, which have desirable provable guarantees the Householder decomposition lacks, are only 3.1× and 4.1× slower — close enough that the theoretical gap in guarantees matters
  • The authors acknowledge the Householder decomposition's missing guarantees but offer no fix; the paper's central contribution is a faster way to do something that may be theoretically unsound

Steelman

9.

FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

  • Both the paper and its critics argue about benchmarks on a single GPU; they share the assumption that FastH's contribution is a faster algorithm — but the deeper move is making SVD-preserving gradient descent a drop-in replacement for NN.LINEAR
  • Before FastH, using the SVD reparameterization meant implementing Householder decompositions, WY representations, and parallel CUDA kernels by hand; after FastH, a PhD student with one RTX 2080 Ti can experiment with SVD-aware architectures without writing a single line of linear algebra
  • The 27.1× figure is a distraction; the 2.7× figure is a floor — the ceiling is the number of researchers who will now build SVD-aware networks because the barrier to entry has collapsed

Original

Continue Reading