Full transcript (Instant)

qm1b-dataset

Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT datas

github.com

Gist

1.
Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

Logic

2.
Prior datasets were too small to reveal scaling laws

QM9 and PCQ each contain fewer than 20 million training examples — orders of magnitude below what neural scaling research demands
QM1B contains 1.07 billion examples across 1.09 million unique molecules, each with up to 1,000 conformers generated by RDKit's ETKDG algorithm
The explicit goal: enable neural scaling law studies for quantum chemistry, a field where data scarcity has been the bottleneck

3.
The data is low-resolution DFT — and the creators know it

QM1B uses the STO-3G basis set, the smallest Gaussian basis set in common use, and PySCF IPU runs in float32, not float64
Energy numerical errors are "similar to that of well-trained neural networks," making energy prediction unreliable; HOMO, LUMO, and HLGap errors are below network noise
The authors explicitly advise against benchmarking on QM1B: "It remains unknown whether the ranking of neural architectures on QM1B will agree with the ranking on experimental datasets or higher resolution DFT datasets"

4.
Pretrain on QM1B, benchmark on QM9 or PCQ

The recommended workflow: train large models on QM1B's massive scale, then fine-tune and benchmark on higher-resolution datasets like QM9 or PCQ
No test set is included — a deliberate design choice to prevent misuse as a benchmark; the authors will communicate evolving best practices through their GitHub repo
The validation split is made on SMILES strings, ensuring the same molecule never appears in both training and validation sets

5.
The dataset was built by two people in 5 days — with known errors

Mathiasen generated data while Helal trained SchNet 9M in an iterative loop, biasing QM1B toward SchNet's needs; the authors acknowledge "It is possible that this process further biased QM1B towards SchNet"
A software error caused 11-heavy-atom computations to run 3–4× longer than planned, generating excess data; IPU 27 alone produced 5.84 million examples, of which only 1.5 million were used
Fewer than 1% of examples were removed by convergence filtering (std < 0.01 eV over the last 5 Kohn-Sham iterations), but the postprocessing code is not open-sourced — the authors intend to rewrite it

6.
QM1B is open-source, self-contained, and Graphcore-funded

Released under Creative Commons Attribution 4.0 International, the same license as GDB11; hosted on Figshare with long-term fees paid by Graphcore
The dataset is entirely self-contained — no external dependencies beyond the initial GDB11 SMILES strings
The authors accept pull requests and will release diffs for updates, but have no plans to recompute QM1B itself; future work focuses on improving PySCF IPU to enable larger, more accurate datasets

Counter-Argument

7.
The dataset's creators are its only benchmarkers — and they're Graphcore employees

QM1B was built by Graphcore Research scientists, funded by Graphcore, and validated against Graphcore's own PySCF IPU implementation on Graphcore hardware — the entire pipeline is a single-vendor artifact
The iterative generation-and-training loop, where Mathiasen tweaked parameters based on Helal's SchNet results, explicitly biased the dataset toward one architecture; the authors acknowledge this but provide no quantification of the bias
No independent benchmark exists, no test set is provided, and the authors' own advice to "pretrain on QM1B and fine-tune/benchmark on QM9 or PCQ" is a hypothesis about transferability, not a demonstrated result — the dataset's value proposition rests on an unproven assumption from the people who built it

Steelman

8.
The real product isn't the dataset — it's the factory

Both the thesis and the counter-argument assume QM1B's value lives in its 1.07 billion rows; they share a hidden premise that a dataset's quality is fixed at release and judged by its contents
The authors' own forward-looking statement reveals a different strategy: "Our main focus for updates will be on extending the capabilities of PySCF IPU to allow the community to create larger and more accurate datasets" — the open-source PySCF IPU library is the strategic asset, not the snapshot
QM1B is the first proof-of-concept that quantum chemistry at this scale is computationally feasible; the next dataset, built by the community using the same factory, will be larger, higher-resolution, and free of single-vendor bias — the question is not whether QM1B is good enough, but whether it makes the next one inevitable

Original

Transcript

qm1b-dataset/DATASHEET.md at main · graphcore-research/qm1b-dataset

github.com

Gist

1.
Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

Logic

2.
Prior datasets were too small to reveal scaling laws

QM9 and PCQ each contain fewer than 20 million training examples — orders of magnitude below what neural scaling research demands
QM1B contains 1.07 billion examples across 1.09 million unique molecules, each with up to 1,000 conformers generated by RDKit's ETKDG algorithm
The explicit goal: enable neural scaling law studies for quantum chemistry, a field where data scarcity has been the bottleneck

3.
The data is low-resolution DFT — and the creators know it

QM1B uses the STO-3G basis set, the smallest Gaussian basis set in common use, and PySCF IPU runs in float32, not float64
Energy numerical errors are "similar to that of well-trained neural networks," making energy prediction unreliable; HOMO, LUMO, and HLGap errors are below network noise
The authors explicitly advise against benchmarking on QM1B: "It remains unknown whether the ranking of neural architectures on QM1B will agree with the ranking on experimental datasets or higher resolution DFT datasets"

4.
Pretrain on QM1B, benchmark on QM9 or PCQ

The recommended workflow: train large models on QM1B's massive scale, then fine-tune and benchmark on higher-resolution datasets like QM9 or PCQ
No test set is included — a deliberate design choice to prevent misuse as a benchmark; the authors will communicate evolving best practices through their GitHub repo
The validation split is made on SMILES strings, ensuring the same molecule never appears in both training and validation sets

5.
The dataset was built by two people in 5 days — with known errors

Mathiasen generated data while Helal trained SchNet 9M in an iterative loop, biasing QM1B toward SchNet's needs; the authors acknowledge "It is possible that this process further biased QM1B towards SchNet"
A software error caused 11-heavy-atom computations to run 3–4× longer than planned, generating excess data; IPU 27 alone produced 5.84 million examples, of which only 1.5 million were used
Fewer than 1% of examples were removed by convergence filtering (std < 0.01 eV over the last 5 Kohn-Sham iterations), but the postprocessing code is not open-sourced — the authors intend to rewrite it

6.
QM1B is open-source, self-contained, and Graphcore-funded

Released under Creative Commons Attribution 4.0 International, the same license as GDB11; hosted on Figshare with long-term fees paid by Graphcore
The dataset is entirely self-contained — no external dependencies beyond the initial GDB11 SMILES strings
The authors accept pull requests and will release diffs for updates, but have no plans to recompute QM1B itself; future work focuses on improving PySCF IPU to enable larger, more accurate datasets

Counter-Argument

7.
The dataset's creators are its only benchmarkers — and they're Graphcore employees

QM1B was built by Graphcore Research scientists, funded by Graphcore, and validated against Graphcore's own PySCF IPU implementation on Graphcore hardware — the entire pipeline is a single-vendor artifact
The iterative generation-and-training loop, where Mathiasen tweaked parameters based on Helal's SchNet results, explicitly biased the dataset toward one architecture; the authors acknowledge this but provide no quantification of the bias
No independent benchmark exists, no test set is provided, and the authors' own advice to "pretrain on QM1B and fine-tune/benchmark on QM9 or PCQ" is a hypothesis about transferability, not a demonstrated result — the dataset's value proposition rests on an unproven assumption from the people who built it

Steelman

8.
The real product isn't the dataset — it's the factory

Both the thesis and the counter-argument assume QM1B's value lives in its 1.07 billion rows; they share a hidden premise that a dataset's quality is fixed at release and judged by its contents
The authors' own forward-looking statement reveals a different strategy: "Our main focus for updates will be on extending the capabilities of PySCF IPU to allow the community to create larger and more accurate datasets" — the open-source PySCF IPU library is the strategic asset, not the snapshot
QM1B is the first proof-of-concept that quantum chemistry at this scale is computationally feasible; the next dataset, built by the community using the same factory, will be larger, higher-resolution, and free of single-vendor bias — the question is not whether QM1B is good enough, but whether it makes the next one inevitable

Gist

1. Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

Logic

2. Prior datasets were too small to reveal scaling laws

3. The data is low-resolution DFT — and the creators know it

4. Pretrain on QM1B, benchmark on QM9 or PCQ

5. The dataset was built by two people in 5 days — with known errors

6. QM1B is open-source, self-contained, and Graphcore-funded

Counter-Argument

7. The dataset's creators are its only benchmarkers — and they're Graphcore employees

Steelman

8. The real product isn't the dataset — it's the factory

Original

Gist

1.

Original

Gist

1. Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

Logic

2. Prior datasets were too small to reveal scaling laws

3. The data is low-resolution DFT — and the creators know it

4. Pretrain on QM1B, benchmark on QM9 or PCQ

5. The dataset was built by two people in 5 days — with known errors

6. QM1B is open-source, self-contained, and Graphcore-funded

Counter-Argument

7. The dataset's creators are its only benchmarkers — and they're Graphcore employees

Steelman

8. The real product isn't the dataset — it's the factory

Original

1.
Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

2.
Prior datasets were too small to reveal scaling laws

3.
The data is low-resolution DFT — and the creators know it

4.
Pretrain on QM1B, benchmark on QM9 or PCQ

5.
The dataset was built by two people in 5 days — with known errors

6.
QM1B is open-source, self-contained, and Graphcore-funded

7.
The dataset's creators are its only benchmarkers — and they're Graphcore employees

8.
The real product isn't the dataset — it's the factory

1.
Graphcore spent 5 days and 320 IPUs to compute 1.07 billion quantum chemistry examples — then refused to ship a test set, because the data is too low-resolution to trust. QM1B is the largest DFT dataset ever built, and its creators are begging you not to benchmark on it.

2.
Prior datasets were too small to reveal scaling laws

3.
The data is low-resolution DFT — and the creators know it

4.
Pretrain on QM1B, benchmark on QM9 or PCQ

5.
The dataset was built by two people in 5 days — with known errors

6.
QM1B is open-source, self-contained, and Graphcore-funded

7.
The dataset's creators are its only benchmarkers — and they're Graphcore employees

8.
The real product isn't the dataset — it's the factory