How big is Big Data?
Daniel T. Speckhard, Tim Bechtel, Luca M. Ghiringhelli, Martin Kuban, Santiago Rigamonti, Claudia Draxl
TL;DR
The paper interrogates what it means for data to be 'big' in materials science by examining data volume, quality, and infrastructure across transferability, similarity-based data verification, CE expressivity, model complexity, and high-throughput infrastructure. It finds that even large databases can struggle to generalize due to limited diversity, shows how similarity measures can reveal data veracity and guide homogeneous data subsets, demonstrates that nonlinear feature spaces can substantially boost expressivity in cluster expansion, and argues that model class often drives performance more than sheer parameter counts. It also highlights the substantial infrastructure required for high-throughput DFT datasets and neural-architecture searches, emphasizing the need for diverse, well-curated datasets and scalable tooling to realize the potential of big data in materials science. Overall, the work motivates ongoing efforts in data diversification, cross-dataset standardization, and cost-aware infrastructure planning to enable robust, transferable ML models in the field.
Abstract
Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.
