Table of Contents
Fetching ...

Lessons from the trenches on evaluating machine-learning systems in materials science

Nawaf Alampara, Mara Schilling-Wilhelmi, Kevin Maik Jablonka

TL;DR

This paper argues that evaluating machine learning systems in materials science cannot rely on a single benchmark or metric, because measurement is inherently shaped by design choices and data provenance. It formalizes evaluation around estimands, estimators, and estimates, highlighting risks of phantom progress when real-world relevance is ignored. The authors advocate evaluation cards to document measurement decisions, transparency gaps, and tradeoffs, and they survey a spectrum of evaluation approaches from traditional benchmarks to red teaming and deployment studies. By outlining frontiers specific to materials science (eg, multiobjective metrics and synthesizability) and general challenges (eg, data generation processes and benchmark maintenance), the work provides a roadmap for more reliable, transferable progress in AI-assisted materials discovery and potentially other scientific domains.

Abstract

Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress. In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs. By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations. Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.

Lessons from the trenches on evaluating machine-learning systems in materials science

TL;DR

This paper argues that evaluating machine learning systems in materials science cannot rely on a single benchmark or metric, because measurement is inherently shaped by design choices and data provenance. It formalizes evaluation around estimands, estimators, and estimates, highlighting risks of phantom progress when real-world relevance is ignored. The authors advocate evaluation cards to document measurement decisions, transparency gaps, and tradeoffs, and they survey a spectrum of evaluation approaches from traditional benchmarks to red teaming and deployment studies. By outlining frontiers specific to materials science (eg, multiobjective metrics and synthesizability) and general challenges (eg, data generation processes and benchmark maintenance), the work provides a roadmap for more reliable, transferable progress in AI-assisted materials discovery and potentially other scientific domains.

Abstract

Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress. In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs. By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations. Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.

Paper Structure

This paper contains 48 sections, 5 figures.

Figures (5)

  • Figure 1: ChemBench mirza2024largelanguagemodelssuperhuman ranking based on different scoring metrics. All metrics are a sum, weighted sum, or maximum values over all multiple-choice questions. The weighted sums are calculated by taking the manually rated difficulty (basic, immediate, advanced) of the question into account. For equal weighting, all categories are weighted even, regardless of the number of questions. The metric all correct is a binary metric indicating if a given answer is completely correct. For normalized Hamming (max), the normalized maximum value of the Hamming loss of each model was taken. We find that the ranking of models changes if we change the metric, or even just the aggregation --- showcasing the importance of proper and transparent design of evaluation suites.
  • Figure 2: Estimand, estimators, estimate: The diagram illustrates a conceptual framework for machine learning system evaluation in materials science, structured around three key components outlined in this article. Often, both estimand and estimator depend on data. This highlights how the boundaries between what we measure and how we measure it are not always clearly delineated in the evaluation of complex machine learning systems for materials science.
  • Figure 3: Common materials science benchmarks and other estimators on the continuum from representational to pragmatic: To visualize this conceptual distinction, we heuristically positioned selected benchmarks along two axes: the horizontal axis reflects how representational (i.e., measuring intrinsic, task-independent properties) versus pragmatic (i.e., task-constructed, decision-driven measures) the evaluation is intended to be; the vertical axis distinguishes benchmarks from other estimators or evaluation tools. The classification is based on qualitative assessment by the authors and should be interpreted as illustrative rather than definitive. For simplicity, we group the evaluation frameworks in quadrants. In reality, however, they live on a continuum. Representational Benchmarks: Representational benchmarks include EGraFFBench https://doi.org/10.48550/arxiv.2310.02428, JARVIS Leaderboard https://doi.org/10.5281/zenodo.14212326Choudhary_2024, MassSpecGym https://doi.org/10.48550/arxiv.2410.23326, MatBench Dunn_2020, MatDeepLearn MatDeepLearn, and MD17 Chmiela2017. On the pragmatic end of the spectrum, relevant benchmarks include ChemBench mirza2024largelanguagemodelssuperhuman, GPQA https://doi.org/10.48550/arxiv.2311.12022, LAB-Bench https://doi.org/10.48550/arxiv.2407.10362, MaScQA https://doi.org/10.48550/arxiv.2308.09115, MatBench Discovery riebesell2024matbenchdiscoveryframework, and matbench-genmetrics Baird2024. Other representational estimators include the Jakob et al. Challenge Jakob2025, MaCBench alampara2024probinglimitationsmultimodallanguage, MatText https://doi.org/10.48550/arxiv.2406.17295, MLIP Arena Chiang_MLIP_Arena, OCP Challenge Chanussot_2021, OpenDAC Sriram_2024, and the TEA Challenge Poltavsky_2025. On the pragmatic side, an additional estimator is the CSP blind test Lommerse2000.
  • Figure 4: Different approaches for evaluating ML systems. This figure compares five evaluation approaches across five key dimensions. Traditional benchmarks offer often low resource intensity, good scalability, automation potential, and numerical reducibility, but often lack real-world applicability. Challenges and competitions excel mainly in real-world applicability. Red teaming and capability discovery score moderately in automation potential and strongly in real-world applicability. Real-world deployment studies provide the highest real-world applicability but perform poorly on other dimensions. Ablation studies and systematic testing show strengths in scalability, automation potential, and numerical reducibility but are typically resource intense and have neutral real-world applicability.
  • Figure :