Table of Contents
Fetching ...

OpenQDC: Open Quantum Data Commons

Cristian Gabellini, Nikhil Shenoy, Stephan Thaler, Semih Canturk, Daniel McNeela, Dominique Beaini, Michael Bronstein, Prudencio Tossou

TL;DR

OpenQDC tackles the fragmentation and inaccessibility of quantum-mechanics data essential for training ML interatomic potentials. It delivers a unified repository of about 37 QM datasets, 400 million geometries, and 250+ methods, with a Python library that normalizes energies and forces and provides streamlined data loading. Benchmarking across TorchMDNet, DimeNet, and SchNet reveals architecture-dependent performance and emphasizes the need for dataset-aware model design. By democratizing access and standardizing data workflows, OpenQDC aims to accelerate MLIP development and broaden adoption in the molecular dynamics community.

Abstract

Machine Learning Interatomic Potentials (MLIPs) are a highly promising alternative to force-fields for molecular dynamics (MD) simulations, offering precise and rapid energy and force calculations. However, Quantum-Mechanical (QM) datasets, crucial for MLIPs, are fragmented across various repositories, hindering accessibility and model development. We introduce the openQDC package, consolidating 37 QM datasets from over 250 quantum methods and 400 million geometries into a single, accessible resource. These datasets are meticulously preprocessed, and standardized for MLIP training, covering a wide range of chemical elements and interactions relevant in organic chemistry. OpenQDC includes tools for normalization and integration, easily accessible via Python. Experiments with well-known architectures like SchNet, TorchMD-Net, and DimeNet reveal challenges for those architectures and constitute a leaderboard to accelerate benchmarking and guide novel algorithms development. Continuously adding datasets to OpenQDC will democratize QM dataset access, foster more collaboration and innovation, enhance MLIP development, and support their adoption in the MD field.

OpenQDC: Open Quantum Data Commons

TL;DR

OpenQDC tackles the fragmentation and inaccessibility of quantum-mechanics data essential for training ML interatomic potentials. It delivers a unified repository of about 37 QM datasets, 400 million geometries, and 250+ methods, with a Python library that normalizes energies and forces and provides streamlined data loading. Benchmarking across TorchMDNet, DimeNet, and SchNet reveals architecture-dependent performance and emphasizes the need for dataset-aware model design. By democratizing access and standardizing data workflows, OpenQDC aims to accelerate MLIP development and broaden adoption in the molecular dynamics community.

Abstract

Machine Learning Interatomic Potentials (MLIPs) are a highly promising alternative to force-fields for molecular dynamics (MD) simulations, offering precise and rapid energy and force calculations. However, Quantum-Mechanical (QM) datasets, crucial for MLIPs, are fragmented across various repositories, hindering accessibility and model development. We introduce the openQDC package, consolidating 37 QM datasets from over 250 quantum methods and 400 million geometries into a single, accessible resource. These datasets are meticulously preprocessed, and standardized for MLIP training, covering a wide range of chemical elements and interactions relevant in organic chemistry. OpenQDC includes tools for normalization and integration, easily accessible via Python. Experiments with well-known architectures like SchNet, TorchMD-Net, and DimeNet reveal challenges for those architectures and constitute a leaderboard to accelerate benchmarking and guide novel algorithms development. Continuously adding datasets to OpenQDC will democratize QM dataset access, foster more collaboration and innovation, enhance MLIP development, and support their adoption in the MD field.

Paper Structure

This paper contains 17 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of a dataset and its structure.
  • Figure 2: Inference time over the potential datasets.