Table of Contents
Fetching ...

How big is Big Data?

Daniel T. Speckhard, Tim Bechtel, Luca M. Ghiringhelli, Martin Kuban, Santiago Rigamonti, Claudia Draxl

TL;DR

The paper interrogates what it means for data to be 'big' in materials science by examining data volume, quality, and infrastructure across transferability, similarity-based data verification, CE expressivity, model complexity, and high-throughput infrastructure. It finds that even large databases can struggle to generalize due to limited diversity, shows how similarity measures can reveal data veracity and guide homogeneous data subsets, demonstrates that nonlinear feature spaces can substantially boost expressivity in cluster expansion, and argues that model class often drives performance more than sheer parameter counts. It also highlights the substantial infrastructure required for high-throughput DFT datasets and neural-architecture searches, emphasizing the need for diverse, well-curated datasets and scalable tooling to realize the potential of big data in materials science. Overall, the work motivates ongoing efforts in data diversification, cross-dataset standardization, and cost-aware infrastructure planning to enable robust, transferable ML models in the field.

Abstract

Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.

How big is Big Data?

TL;DR

The paper interrogates what it means for data to be 'big' in materials science by examining data volume, quality, and infrastructure across transferability, similarity-based data verification, CE expressivity, model complexity, and high-throughput infrastructure. It finds that even large databases can struggle to generalize due to limited diversity, shows how similarity measures can reveal data veracity and guide homogeneous data subsets, demonstrates that nonlinear feature spaces can substantially boost expressivity in cluster expansion, and argues that model class often drives performance more than sheer parameter counts. It also highlights the substantial infrastructure required for high-throughput DFT datasets and neural-architecture searches, emphasizing the need for diverse, well-curated datasets and scalable tooling to realize the potential of big data in materials science. Overall, the work motivates ongoing efforts in data diversification, cross-dataset standardization, and cost-aware infrastructure planning to enable robust, transferable ML models in the field.

Abstract

Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.
Paper Structure (12 sections, 5 equations, 6 figures)

This paper contains 12 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: Predicted versus calculated formation energies for AFLOW and Materials Project (MP) data. Top-left: model trained and tested on AFLOW data; top-right: model trained on AFLOW and tested on MP; bottom-left: model trained on MP and tested on AFLOW; bottom-right: model trained and tested on MP.
  • Figure 2: Top: Distribution of formation energies in the two datasets containing 50,652 (AFLOW) and 54,438 (MP) unique composition-spacegroup pairs. Bottom: Distribution of formation energies for the 8,591 composition-spacegroup pairs that are present in both datasets.
  • Figure 3: DOS similarity matrices for h-BN obtained with different basis-set sizes and $k$-meshes, and two different exchange-correlation (xc) functionals. The data are sorted by the latter, where low indices ($\leq$ 71) correspond to the local-density approximation (LDA), high indices (>71) to the generalized-gradient functional PBE. The color code indicates the similarity coefficient. The bottom panel shows the number of $k$-points (blue) and the number of basis functions (orange).
  • Figure 4: Left: Fitting errors for standard (blue) and nonlinear CE models (orange, red) based on a pool of 27 clusters (top) and 176 clusters (bottom). Right: Predicted versus target bandgaps for the standard CE with 27 clusters and the nonlinear CE of degree 3 models with 150 features (top); and standard CE and nonlinear CE models with 150 clusters / features (bottom).
  • Figure 5: Workflow for high-throughput geometry optimization, followed by a bandstructure calculation. Crystal structures are pulled from a NOMAD Oasis instance and stored in an ASE database (DB). The structures are read by the Python API excitingtools, and a geometry optimization is carried out, consisting of multiple single-point ground-state calculations with exciting. For the resulting relaxed geometry, a bandstructure calculation is performed. All output files are uploaded to a NOMAD Oasis instance.
  • ...and 1 more figures