Table of Contents
Fetching ...

The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts

Richard Tran, Janice Lan, Muhammed Shuaibi, Brandon M. Wood, Siddharth Goyal, Abhishek Das, Javier Heras-Domingo, Adeesh Kolluru, Ammar Rizvi, Nima Shoghi, Anuroop Sriram, Felix Therrien, Jehad Abed, Oleksandr Voznyy, Edward H. Sargent, Zachary Ulissi, C. Lawrence Zitnick

TL;DR

The Open Catalyst 2022 (OC22) dataset provides a large, open, oxide-focused benchmark for machine learning in catalysis, addressing a key data gap for oxide surfaces and adsorbates. It defines three total-energy–oriented tasks, benchmarks several cutting-edge GNNs (notably GemNet-OC), and demonstrates the benefits of joint training with OC20 data and transfer learning. The work reveals strong performance of state-of-the-art models on oxide systems, while highlighting challenges in long-range interactions, magnetism, and higher-level theory, and it introduces practical tools such as linear energy referencing and Gibbs-adapted energy corrections. Together, OC22 offers a comprehensive platform and public leaderboard to accelerate ML-driven oxide catalyst discovery and to explore broader surface energetic properties beyond adsorption energies.

Abstract

The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of OER catalysts. To address this, we developed the OC22 dataset, consisting of 62,331 DFT relaxations (~9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide pre-defined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ~36% improvement in energy predictions when combining the chemically dissimilar OC20 and OC22 datasets via fine-tuning. Similarly, we achieved a ~19% improvement in total energy predictions on OC20 and a ~9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. Dataset and baseline models are open sourced, and a public leaderboard is available to encourage continued community developments on the total energy tasks and data.

The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts

TL;DR

The Open Catalyst 2022 (OC22) dataset provides a large, open, oxide-focused benchmark for machine learning in catalysis, addressing a key data gap for oxide surfaces and adsorbates. It defines three total-energy–oriented tasks, benchmarks several cutting-edge GNNs (notably GemNet-OC), and demonstrates the benefits of joint training with OC20 data and transfer learning. The work reveals strong performance of state-of-the-art models on oxide systems, while highlighting challenges in long-range interactions, magnetism, and higher-level theory, and it introduces practical tools such as linear energy referencing and Gibbs-adapted energy corrections. Together, OC22 offers a comprehensive platform and public leaderboard to accelerate ML-driven oxide catalyst discovery and to explore broader surface energetic properties beyond adsorption energies.

Abstract

The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of OER catalysts. To address this, we developed the OC22 dataset, consisting of 62,331 DFT relaxations (~9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide pre-defined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ~36% improvement in energy predictions when combining the chemically dissimilar OC20 and OC22 datasets via fine-tuning. Similarly, we achieved a ~19% improvement in total energy predictions on OC20 and a ~9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. Dataset and baseline models are open sourced, and a public leaderboard is available to encourage continued community developments on the total energy tasks and data.
Paper Structure (26 sections, 17 equations, 14 figures, 20 tables)

This paper contains 26 sections, 17 equations, 14 figures, 20 tables.

Figures (14)

  • Figure 1: Overview of the contents and impact areas of the dataset. The water nucleophilic attack mechanism is highlighted for the reaction, with H2O and O2 as reactants and products, respectively. Inset images are a random sample of the dataset.
  • Figure 2: Construction of rutile (110) slabs and adsorbate+slabs. (a) Dashed lines indicate the different possible terminations ($T_{1}, T_{2}$ and $T_{3}$). The slab is symmetric about $T_{3}$. (b) The $T_{2}$ terminated surface with its periodic boundary (blue dashed lines) contains 8 oxygen sites. Random removal of 3 surface oxygen (dark red) creates vacancy defects (transparent).
  • Figure 3: Overview of the adsorbate specific placement strategies. Adsorbates include H*, O*, N*, C*, OOH*, OH*, OH2*, O2*, and CO* (left). Adsorbates can either bind to undercoordinated surface metals (first row of strategies) or to surface oxygen to form new intermediates (second row).
  • Figure 4: A typical workflow, motivating the need for total energy models beyond adsorption energies. Total energy models would allow one to study all parts of this workflow, and not just the final relaxation like adsorption energy models. (a) A bulk structure is selected from material datasets and a surface is created. (b) Surface terminations are enumerated and studied with to identify the most stable termination. Surface Pourbaix diagrams are created and used to make this decision. (c) Only after the most stable termination is identified, an adsorbate is placed and (d) The adsorbate+slab system is relaxed and the referenced adsorption energy is computed.
  • Figure 5: The various training strategies explored in . A. The -only strategy involves just using for the proposed tasks. B. Joint training refers to models trained on both and simultaneously. C. In fine-tuning, pretrained models for are used as starting points to train on just .
  • ...and 9 more figures