Full transcript (Instant)

What if Neural Networks had SVDs?

A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster tha

Unknown · proceedings.neurips.cc

Gist

1.
A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster than computing the SVD itself, has never worked in practice. FastH fixes it, speeding up matrix inversion 2.7× and determinant 3.5× on a single RTX 2080 Ti.

Logic

2.
The SVD unlocks operations that standard methods can't

Matrix determinant, inverse, exponential, Cayley map, spectral normalization, and low-rank compression all become faster when you have the SVD of the weight matrix W = UΣVᵀ
Standard methods like TORCH.INVERSE and Padé approximation run in O(d³) time; SVD-based methods drop to O(d²)
On a d × d matrix, computing the SVD itself takes O(d³) — no faster than the operations it's supposed to accelerate

3.
The SVD reparameterization avoids computing the SVD — in theory

Zhang et al. (ICML 2018) decompose orthogonal matrices U and V into products of d Householder matrices H₁, ..., H_d, allowing gradient descent to preserve orthogonality without explicit SVD computation
Each Householder multiplication takes O(dm) time for a mini-batch of size m; d sequential multiplications yield O(d²m) total time — asymptotically faster than O(d³)
The reparameterization was originally developed to force singular values into [1 ± ε] for RNNs, not to speed up matrix operations

4.
Sequential inner products kill the speed-up on GPUs

Each Householder multiplication entails computing an inner product; d sequential multiplications require O(d) sequential inner products
A GPU with 4,000 cores computing inner products on 100-dimensional vectors can only utilize 100 cores simultaneously — 3,900 sit idle
The parallel algorithm from Zhang et al. takes O(d³) time, no faster than computing the SVD itself; the sequential algorithm crashes at d > 448

5.
FastH replaces sequential inner products with parallel matrix multiplications

The key is a 1987 result from Bischof and Van Loan: any product of m Householder matrices can be written as I − 2WYᵀ, where W and Y are d × m matrices computable in O(dm) time with m sequential Householder multiplications
FastH divides the d Householder matrices into d/m products Pᵢ, each of m matrices; all d/m products are computed in parallel
Forward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications instead of O(d) sequential inner products
Backward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications; gradients for each Householder vector are computed in parallel

6.
FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

At d = 448, FastH is roughly 25× faster than the sequential algorithm; at d = 3072, FastH is faster than the sequential algorithm at d = 448
At d > 500, relative improvement increases with d; at d = 3072, FastH is 6.2× faster than the parallel algorithm
On matrix operations at d = 768: 3.1× faster than Cayley map, 4.1× faster than matrix exponential, 2.7× faster than inverse, 3.5× faster than determinant — all measured as total time including forward pass, operation, and backward pass

7.
FastH computes the same output — no expressiveness trade-off

FastH computes H₁···H_d X, identical to the sequential algorithm; previous work often limited the number of Householder matrices to reduce computation, sacrificing expressiveness
The parameter k controls a time-parallelism trade-off: O(d²k + d²m) time, O(d/k + k) sequential operations, space O(d²m/k); k = O(√d) minimizes sequential work
Code is a single-line change from NN.LINEAR to LINEAR_SVD; implemented in CUDA because Python's parallelization was inadequate

Counter-Argument

8.
The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

At d = 768, FastH is 2.7× faster than TORCH.INVERSE — a factor of 2.7, not 27.1×; the 27.1× figure compares FastH to the sequential algorithm from Zhang et al., which the authors themselves call "not fast enough to speed up any matrix operation"
The Cayley map and matrix exponential approaches, which have desirable provable guarantees the Householder decomposition lacks, are only 3.1× and 4.1× slower — close enough that the theoretical gap in guarantees matters
The authors acknowledge the Householder decomposition's missing guarantees but offer no fix; the paper's central contribution is a faster way to do something that may be theoretically unsound

Steelman

9.
FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

Both the paper and its critics argue about benchmarks on a single GPU; they share the assumption that FastH's contribution is a faster algorithm — but the deeper move is making SVD-preserving gradient descent a drop-in replacement for NN.LINEAR
Before FastH, using the SVD reparameterization meant implementing Householder decompositions, WY representations, and parallel CUDA kernels by hand; after FastH, a PhD student with one RTX 2080 Ti can experiment with SVD-aware architectures without writing a single line of linear algebra
The 27.1× figure is a distraction; the 2.7× figure is a floor — the ceiling is the number of researchers who will now build SVD-aware networks because the barrier to entry has collapsed

Original

Transcript

What if Neural Networks had SVDs?

Unknown · proceedings.neurips.cc

Gist

1.
A GPU with 4,000 cores can only use 100 of them when computing sequential inner products on 100-dimensional vectors — and that idle capacity is why the SVD reparameterization, theoretically faster than computing the SVD itself, has never worked in practice. FastH fixes it, speeding up matrix inversion 2.7× and determinant 3.5× on a single RTX 2080 Ti.

Logic

2.
The SVD unlocks operations that standard methods can't

Matrix determinant, inverse, exponential, Cayley map, spectral normalization, and low-rank compression all become faster when you have the SVD of the weight matrix W = UΣVᵀ
Standard methods like TORCH.INVERSE and Padé approximation run in O(d³) time; SVD-based methods drop to O(d²)
On a d × d matrix, computing the SVD itself takes O(d³) — no faster than the operations it's supposed to accelerate

3.
The SVD reparameterization avoids computing the SVD — in theory

Zhang et al. (ICML 2018) decompose orthogonal matrices U and V into products of d Householder matrices H₁, ..., H_d, allowing gradient descent to preserve orthogonality without explicit SVD computation
Each Householder multiplication takes O(dm) time for a mini-batch of size m; d sequential multiplications yield O(d²m) total time — asymptotically faster than O(d³)
The reparameterization was originally developed to force singular values into [1 ± ε] for RNNs, not to speed up matrix operations

4.
Sequential inner products kill the speed-up on GPUs

Each Householder multiplication entails computing an inner product; d sequential multiplications require O(d) sequential inner products
A GPU with 4,000 cores computing inner products on 100-dimensional vectors can only utilize 100 cores simultaneously — 3,900 sit idle
The parallel algorithm from Zhang et al. takes O(d³) time, no faster than computing the SVD itself; the sequential algorithm crashes at d > 448

5.
FastH replaces sequential inner products with parallel matrix multiplications

The key is a 1987 result from Bischof and Van Loan: any product of m Householder matrices can be written as I − 2WYᵀ, where W and Y are d × m matrices computable in O(dm) time with m sequential Householder multiplications
FastH divides the d Householder matrices into d/m products Pᵢ, each of m matrices; all d/m products are computed in parallel
Forward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications instead of O(d) sequential inner products
Backward pass: O(d²m) time, O(d/m + m) sequential matrix multiplications; gradients for each Householder vector are computed in parallel

6.
FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

At d = 448, FastH is roughly 25× faster than the sequential algorithm; at d = 3072, FastH is faster than the sequential algorithm at d = 448
At d > 500, relative improvement increases with d; at d = 3072, FastH is 6.2× faster than the parallel algorithm
On matrix operations at d = 768: 3.1× faster than Cayley map, 4.1× faster than matrix exponential, 2.7× faster than inverse, 3.5× faster than determinant — all measured as total time including forward pass, operation, and backward pass

7.
FastH computes the same output — no expressiveness trade-off

FastH computes H₁···H_d X, identical to the sequential algorithm; previous work often limited the number of Householder matrices to reduce computation, sacrificing expressiveness
The parameter k controls a time-parallelism trade-off: O(d²k + d²m) time, O(d/k + k) sequential operations, space O(d²m/k); k = O(√d) minimizes sequential work
Code is a single-line change from NN.LINEAR to LINEAR_SVD; implemented in CUDA because Python's parallelization was inadequate

Counter-Argument

8.
The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

At d = 768, FastH is 2.7× faster than TORCH.INVERSE — a factor of 2.7, not 27.1×; the 27.1× figure compares FastH to the sequential algorithm from Zhang et al., which the authors themselves call "not fast enough to speed up any matrix operation"
The Cayley map and matrix exponential approaches, which have desirable provable guarantees the Householder decomposition lacks, are only 3.1× and 4.1× slower — close enough that the theoretical gap in guarantees matters
The authors acknowledge the Householder decomposition's missing guarantees but offer no fix; the paper's central contribution is a faster way to do something that may be theoretically unsound

Steelman

9.
FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

Both the paper and its critics argue about benchmarks on a single GPU; they share the assumption that FastH's contribution is a faster algorithm — but the deeper move is making SVD-preserving gradient descent a drop-in replacement for NN.LINEAR
Before FastH, using the SVD reparameterization meant implementing Householder decompositions, WY representations, and parallel CUDA kernels by hand; after FastH, a PhD student with one RTX 2080 Ti can experiment with SVD-aware architectures without writing a single line of linear algebra
The 27.1× figure is a distraction; the 2.7× figure is a floor — the ceiling is the number of researchers who will now build SVD-aware networks because the barrier to entry has collapsed

Gist

Logic

2. The SVD unlocks operations that standard methods can't

3. The SVD reparameterization avoids computing the SVD — in theory

4. Sequential inner products kill the speed-up on GPUs

5. FastH replaces sequential inner products with parallel matrix multiplications

6. FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

7. FastH computes the same output — no expressiveness trade-off

Counter-Argument

8. The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

Steelman

9. FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

Original

Gist

1.

Original

Gist

Logic

2. The SVD unlocks operations that standard methods can't

3. The SVD reparameterization avoids computing the SVD — in theory

4. Sequential inner products kill the speed-up on GPUs

5. FastH replaces sequential inner products with parallel matrix multiplications

6. FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

7. FastH computes the same output — no expressiveness trade-off

Counter-Argument

8. The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

Steelman

9. FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

Original

2.
The SVD unlocks operations that standard methods can't

3.
The SVD reparameterization avoids computing the SVD — in theory

4.
Sequential inner products kill the speed-up on GPUs

5.
FastH replaces sequential inner products with parallel matrix multiplications

6.
FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

7.
FastH computes the same output — no expressiveness trade-off

8.
The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

9.
FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone

2.
The SVD unlocks operations that standard methods can't

3.
The SVD reparameterization avoids computing the SVD — in theory

4.
Sequential inner products kill the speed-up on GPUs

5.
FastH replaces sequential inner products with parallel matrix multiplications

6.
FastH is 27× faster than the sequential algorithm and 6.2× faster than the parallel one

7.
FastH computes the same output — no expressiveness trade-off

8.
The paper's own numbers show FastH barely beats standard methods on the operations it was built to accelerate

9.
FastH's real value isn't speed — it's the single-line API that makes SVD-aware networks accessible to everyone