Graphcore
Full transcript (Instant)

qm1b-dataset/DATASHEET.md at main · graphcore-research/qm1b-dataset

Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT datas

github.com

Gist

1.

Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

Logic

2.

Prior datasets were too small to reveal scaling laws

  • QM9 and PCQ each contain fewer than 20 million training examples — orders of magnitude below what neural scaling research demands
  • QM1B contains 1.07 billion examples across 1.09 million unique molecules, each with up to 1,000 conformers generated by RDKit's ETKDG algorithm
  • The explicit goal: enable neural scaling law studies for quantum chemistry, a field where data scarcity has been the bottleneck

3.

The data is low-resolution DFT — and the creators know it

  • QM1B uses the STO-3G basis set, the smallest Gaussian basis set in common use, and PySCF IPU runs in float32, not float64
  • Energy numerical errors are "similar to that of well-trained neural networks," making energy prediction unreliable; HOMO, LUMO, and HLGap errors are below network noise
  • The authors explicitly advise against benchmarking on QM1B: "It remains unknown whether the ranking of neural architectures on QM1B will agree with the ranking on experimental datasets or higher resolution DFT datasets"

4.

Pretrain on QM1B, benchmark on QM9 or PCQ

  • The recommended workflow: train large models on QM1B's massive scale, then fine-tune and benchmark on higher-resolution datasets like QM9 or PCQ
  • No test set is included — a deliberate design choice to prevent misuse as a benchmark; the authors will communicate evolving best practices through their GitHub repo
  • The validation split is made on SMILES strings, ensuring the same molecule never appears in both training and validation sets

5.

The dataset was built by two people in 5 days — with known errors

  • Mathiasen generated data while Helal trained SchNet 9M in an iterative loop, biasing QM1B toward SchNet's needs; the authors acknowledge "It is possible that this process further biased QM1B towards SchNet"
  • A software error caused 11-heavy-atom computations to run 3–4× longer than planned, generating excess data; IPU 27 alone produced 5.84 million examples, of which only 1.5 million were used
  • Fewer than 1% of examples were removed by convergence filtering (std < 0.01 eV over the last 5 Kohn-Sham iterations), but the postprocessing code is not open-sourced — the authors intend to rewrite it

6.

QM1B is open-source, self-contained, and Graphcore-funded

  • Released under Creative Commons Attribution 4.0 International, the same license as GDB11; hosted on Figshare with long-term fees paid by Graphcore
  • The dataset is entirely self-contained — no external dependencies beyond the initial GDB11 SMILES strings
  • The authors accept pull requests and will release diffs for updates, but have no plans to recompute QM1B itself; future work focuses on improving PySCF IPU to enable larger, more accurate datasets

Counter-Argument

7.

The dataset's creators are its only benchmarkers — and they're Graphcore employees

  • QM1B was built by Graphcore Research scientists, funded by Graphcore, and validated against Graphcore's own PySCF IPU implementation on Graphcore hardware — the entire pipeline is a single-vendor artifact
  • The iterative generation-and-training loop, where Mathiasen tweaked parameters based on Helal's SchNet results, explicitly biased the dataset toward one architecture; the authors acknowledge this but provide no quantification of the bias
  • No independent benchmark exists, no test set is provided, and the authors' own advice to "pretrain on QM1B and fine-tune/benchmark on QM9 or PCQ" is a hypothesis about transferability, not a demonstrated result — the dataset's value proposition rests on an unproven assumption from the people who built it

Steelman

8.

The real product isn't the dataset — it's the factory

  • Both the thesis and the counter-argument assume QM1B's value lives in its 1.07 billion rows; they share a hidden premise that a dataset's quality is fixed at release and judged by its contents
  • The authors' own forward-looking statement reveals a different strategy: "Our main focus for updates will be on extending the capabilities of PySCF IPU to allow the community to create larger and more accurate datasets" — the open-source PySCF IPU library is the strategic asset, not the snapshot
  • QM1B is the first proof-of-concept that quantum chemistry at this scale is computationally feasible; the next dataset, built by the community using the same factory, will be larger, higher-resolution, and free of single-vendor bias — the question is not whether QM1B is good enough, but whether it makes the next one inevitable

Original

Continue Reading

Full transcript (Deep)

qm1b-dataset/DATASHEET.md at main · graphcore-research/qm1b-dataset

Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT datas

github.com

Gist

1.

Original

Continue Reading

Transcript

qm1b-dataset/DATASHEET.md at main · graphcore-research/qm1b-dataset

Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT datas

github.com

Gist

1.

Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

Logic

2.

Prior datasets were too small to reveal scaling laws

  • QM9 and PCQ each contain fewer than 20 million training examples — orders of magnitude below what neural scaling research demands
  • QM1B contains 1.07 billion examples across 1.09 million unique molecules, each with up to 1,000 conformers generated by RDKit's ETKDG algorithm
  • The explicit goal: enable neural scaling law studies for quantum chemistry, a field where data scarcity has been the bottleneck

3.

The data is low-resolution DFT — and the creators know it

  • QM1B uses the STO-3G basis set, the smallest Gaussian basis set in common use, and PySCF IPU runs in float32, not float64
  • Energy numerical errors are "similar to that of well-trained neural networks," making energy prediction unreliable; HOMO, LUMO, and HLGap errors are below network noise
  • The authors explicitly advise against benchmarking on QM1B: "It remains unknown whether the ranking of neural architectures on QM1B will agree with the ranking on experimental datasets or higher resolution DFT datasets"

4.

Pretrain on QM1B, benchmark on QM9 or PCQ

  • The recommended workflow: train large models on QM1B's massive scale, then fine-tune and benchmark on higher-resolution datasets like QM9 or PCQ
  • No test set is included — a deliberate design choice to prevent misuse as a benchmark; the authors will communicate evolving best practices through their GitHub repo
  • The validation split is made on SMILES strings, ensuring the same molecule never appears in both training and validation sets

5.

The dataset was built by two people in 5 days — with known errors

  • Mathiasen generated data while Helal trained SchNet 9M in an iterative loop, biasing QM1B toward SchNet's needs; the authors acknowledge "It is possible that this process further biased QM1B towards SchNet"
  • A software error caused 11-heavy-atom computations to run 3–4× longer than planned, generating excess data; IPU 27 alone produced 5.84 million examples, of which only 1.5 million were used
  • Fewer than 1% of examples were removed by convergence filtering (std < 0.01 eV over the last 5 Kohn-Sham iterations), but the postprocessing code is not open-sourced — the authors intend to rewrite it

6.

QM1B is open-source, self-contained, and Graphcore-funded

  • Released under Creative Commons Attribution 4.0 International, the same license as GDB11; hosted on Figshare with long-term fees paid by Graphcore
  • The dataset is entirely self-contained — no external dependencies beyond the initial GDB11 SMILES strings
  • The authors accept pull requests and will release diffs for updates, but have no plans to recompute QM1B itself; future work focuses on improving PySCF IPU to enable larger, more accurate datasets

Counter-Argument

7.

The dataset's creators are its only benchmarkers — and they're Graphcore employees

  • QM1B was built by Graphcore Research scientists, funded by Graphcore, and validated against Graphcore's own PySCF IPU implementation on Graphcore hardware — the entire pipeline is a single-vendor artifact
  • The iterative generation-and-training loop, where Mathiasen tweaked parameters based on Helal's SchNet results, explicitly biased the dataset toward one architecture; the authors acknowledge this but provide no quantification of the bias
  • No independent benchmark exists, no test set is provided, and the authors' own advice to "pretrain on QM1B and fine-tune/benchmark on QM9 or PCQ" is a hypothesis about transferability, not a demonstrated result — the dataset's value proposition rests on an unproven assumption from the people who built it

Steelman

8.

The real product isn't the dataset — it's the factory

  • Both the thesis and the counter-argument assume QM1B's value lives in its 1.07 billion rows; they share a hidden premise that a dataset's quality is fixed at release and judged by its contents
  • The authors' own forward-looking statement reveals a different strategy: "Our main focus for updates will be on extending the capabilities of PySCF IPU to allow the community to create larger and more accurate datasets" — the open-source PySCF IPU library is the strategic asset, not the snapshot
  • QM1B is the first proof-of-concept that quantum chemistry at this scale is computationally feasible; the next dataset, built by the community using the same factory, will be larger, higher-resolution, and free of single-vendor bias — the question is not whether QM1B is good enough, but whether it makes the next one inevitable

Original

Continue Reading