CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning

Ulrik Friis-Jensen; Frederik L. Johansen; Andy S. Anker; Erik B. Dam; Kirsten M. Ø. Jensen; Raghavendra Selvan

CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning

Ulrik Friis-Jensen, Frederik L. Johansen, Andy S. Anker, Erik B. Dam, Kirsten M. Ø. Jensen, Raghavendra Selvan

TL;DR

This work introduces CHILI, chemically-informed large-scale graph datasets for inorganic nanomaterials, addressing a gap in graph ML where periodicity, symmetry, and ultra-large graphs hinder modeling. It provides two open datasets, CHILI-3K and CHILI-100K, generated from unit-cell expansions and COD-derived structures, each with rich node/edge features, crystallographic metadata, and simulated scattering data to support 11 property-prediction and 6 structure-generation tasks. A broad benchmarking study across multiple GNN backbones reveals that while several tasks are tractable, many remain challenging, especially for diverse, large-scale nanomaterial graphs; EdgeCNN often yields strongest performance, and structure-generation tasks expose limitations in current generative Graph ML approaches. By offering these resources and baselines, CHILI aims to catalyze methodological advances in graph ML for inorganic nanomaterials and spur progress toward scalable, structure-aware generative models with practical materials applications.

Abstract

Advances in graph machine learning (ML) have been driven by applications in chemistry as graphs have remained the most expressive representations of molecules. While early graph ML methods focused primarily on small organic molecules, recently, the scope of graph ML has expanded to include inorganic materials. Modelling the periodicity and symmetry of inorganic crystalline materials poses unique challenges, which existing graph ML methods are unable to address. Moving to inorganic nanomaterials increases complexity as the scale of number of nodes within each graph can be broad ($10$ to $10^5$). The bulk of existing graph ML focuses on characterising molecules and materials by predicting target properties with graphs as input. However, the most exciting applications of graph ML will be in their generative capabilities, which is currently not at par with other domains such as images or text. We invite the graph ML community to address these open challenges by presenting two new chemically-informed large-scale inorganic (CHILI) nanomaterials datasets: A medium-scale dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types (CHILI-3K) and a large-scale dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures (CHILI-100K). We define 11 property prediction tasks and 6 structure prediction tasks, which are of special interest for nanomaterial research. We benchmark the performance of a wide array of baseline methods and use these benchmarking results to highlight areas which need future work. To the best of our knowledge, CHILI-3K and CHILI-100K are the first open-source nanomaterial datasets of this scale -- both on the individual graph level and of the dataset as a whole -- and the only nanomaterials datasets with high structural and elemental diversity.

CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning

TL;DR

Abstract

). The bulk of existing graph ML focuses on characterising molecules and materials by predicting target properties with graphs as input. However, the most exciting applications of graph ML will be in their generative capabilities, which is currently not at par with other domains such as images or text. We invite the graph ML community to address these open challenges by presenting two new chemically-informed large-scale inorganic (CHILI) nanomaterials datasets: A medium-scale dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types (CHILI-3K) and a large-scale dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures (CHILI-100K). We define 11 property prediction tasks and 6 structure prediction tasks, which are of special interest for nanomaterial research. We benchmark the performance of a wide array of baseline methods and use these benchmarking results to highlight areas which need future work. To the best of our knowledge, CHILI-3K and CHILI-100K are the first open-source nanomaterial datasets of this scale -- both on the individual graph level and of the dataset as a whole -- and the only nanomaterials datasets with high structural and elemental diversity.

Paper Structure (28 sections, 15 figures, 8 tables)

This paper contains 28 sections, 15 figures, 8 tables.

Introduction
CHILI Datasets
CHILI-3K
CHILI-100K
Data structure
Dataset statistics
Related Work
Data sources
Graph ML tasks
Experiments and benchmarking
Results and discussion
Conclusion
Data generation
CIF construction
Crystallography Open Database query
...and 13 more sections

Figures (15)

Figure 1: High-level schematic showing the five stages involved in the creation of the CHILI-datasets: (1) Querying and cleaning CIFs. (2) Extraction of crystal unit cells. (3) Expansion of unit cells into supercells and subsequent centering. (4) Cutting of nanoparticles into different sizes and padding of edge environments following the described rules, conversion into graphs with node- and edge- features. (5) Generation of graph-level properties from CIF (crystal type, crystal system, spacegroup, etc.) and simulation of scattering data.
Figure 2: The unit cells of the 12 crystal types present in the CHILI-3K dataset. For all shown structures, copper (Cu) is the metal. The unit cells are visualized using VESTA Momma2008VESTA with the polyhedral style. The unit cells are shown from the standard orientation of a crystal shape, which is one of the 7 view options in VESTA.
Figure 3: The periodic table with each element colored depending on if they are included in CHILI-3K (blue), CHILI-100K (orange) or none of them (light grey). The shade of the colors indicate whether the element is considered a metal (bright) or a non-metal (muted).
Figure 4: a) Distribution of crystal systems in the CHILI-3K dataset (blue) and the CHILI-100K dataset (orange). b) Distribution of the number of unique elements in each structure for the CHILI-3K dataset (blue) and the CHILI-100K dataset (orange). The inset plot shows 6 and 7 elements at a more appropriate y-axis scale. c) Distribution of the size of the generated nanoparticles for the CHILI-3K dataset (blue) and the CHILI-100K dataset (orange).
Figure 5: Distribution of crystal types in the CHILI-3K dataset.
...and 10 more figures

CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning

TL;DR

Abstract

CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (15)