Table of Contents
Fetching ...

Perspective: Towards sustainable exploration of chemical spaces with machine learning

Leonardo Medrano Sandonas, David Balcells, Anton Bochkarev, Jacqueline M. Cole, Volker L. Deringer, Werner Dobrautz, Adrian Ehrenhofer, Thorben Frank, Pascal Friederich, Rico Friedrich, Janine George, Luca Ghiringhelli, Alejandra Hinostroza Caldas, Veronika Juraskova, Hannes Kneiding, Yury Lysogorskiy, Johannes T. Margraf, Hanna Türk, Anatole von Lilienfeld, Milica Todorović, Alexandre Tkatchenko, Mariana Rossi, Gianaurelio Cuniberti

Abstract

Artificial intelligence is transforming molecular and materials science, but its growing computational and data demands raise critical sustainability challenges. In this Perspective, we examine resource considerations across the AI-driven discovery pipeline--from quantum-mechanical (QM) data generation and model training to automated, self-driving research workflows--building on discussions from the ``SusML workshop: Towards sustainable exploration of chemical spaces with machine learning'' held in Dresden, Germany. In this context, the availability of large quantum datasets has enabled rigorous benchmarking and rapid methodological progress, while also incurring substantial energy and infrastructure costs. We highlight emerging strategies to enhance efficiency, including general-purpose machine learning (ML) models, multi-fidelity approaches, model distillation, and active learning. Moreover, incorporating physics-based constraints within hierarchical workflows, where fast ML surrogates are applied broadly and high-accuracy QM methods are used selectively, can further optimize resource use without compromising reliability. Equally important is bridging the gap between idealized computational predictions and real-world conditions by accounting for synthesizability and multi-objective design criteria, which is essential for practical impact. Finally, we argue that sustainable progress will rely on open data and models, reusable workflows, and domain-specific AI systems that maximize scientific value per unit of computation, enabling efficient and responsible discovery of technological materials and therapeutics.

Perspective: Towards sustainable exploration of chemical spaces with machine learning

Abstract

Artificial intelligence is transforming molecular and materials science, but its growing computational and data demands raise critical sustainability challenges. In this Perspective, we examine resource considerations across the AI-driven discovery pipeline--from quantum-mechanical (QM) data generation and model training to automated, self-driving research workflows--building on discussions from the ``SusML workshop: Towards sustainable exploration of chemical spaces with machine learning'' held in Dresden, Germany. In this context, the availability of large quantum datasets has enabled rigorous benchmarking and rapid methodological progress, while also incurring substantial energy and infrastructure costs. We highlight emerging strategies to enhance efficiency, including general-purpose machine learning (ML) models, multi-fidelity approaches, model distillation, and active learning. Moreover, incorporating physics-based constraints within hierarchical workflows, where fast ML surrogates are applied broadly and high-accuracy QM methods are used selectively, can further optimize resource use without compromising reliability. Equally important is bridging the gap between idealized computational predictions and real-world conditions by accounting for synthesizability and multi-objective design criteria, which is essential for practical impact. Finally, we argue that sustainable progress will rely on open data and models, reusable workflows, and domain-specific AI systems that maximize scientific value per unit of computation, enabling efficient and responsible discovery of technological materials and therapeutics.

Paper Structure

This paper contains 25 sections, 8 figures.

Figures (8)

  • Figure 1: Scheme illustrating the sustainability topics discussed at the SusML workshop (Dresden, Germany) and in this Perspective, spanning the AI-driven discovery pipeline from quantum-mechanical (QM) data generation and model training to automated, self-driving research workflows.
  • Figure 2: Standard workflow for developing predictive models of molecular and materials properties, combining quantum-inspired representations with machine-learning (ML) techniques (e.g., neural networks, kernel methods, and tree-based models). Certain ML models can be made interpretable through additional explainability techniques, such as symbolic regression, attention mechanisms, and SHAP (SHapley Additive exPlanations) analysis..
  • Figure 3: Data-driven discovery of traditional vs. emerging 2D materials. (a) 2D materials such as graphene are traditionally exfoliated from naturally layered vdW compounds such as graphite. Large computational materials databases (b) have already fueled the discovery of several thousand such systems. For the emerging class of non-vdW 2D materials derived from non-layered crystals such as hematite ($\alpha$-Fe$_2$O$_3$) (c), data-driven research and ML based modelling is still at an early stage. Colors: C brown, Fe gray, and O red.
  • Figure 4: General scheme of structure generation with generative models. Generative models are trained to learn the underlying distribution of a dataset of molecules or materials encoded in a representation, together with their conditioning label on desired properties (gray arrows). From a trained model, new data with the desired properties can be sampled from noise, along with the respective conditioning (blue arrows). The included histogram is reproduced from Ref. [turkAssessingDeepGenerative].
  • Figure 5: Classical and AI-powered approaches along the PSPP relationship Ehrenhofer2025LLM_paper: Classical approaches (i) to connect the different parts range from quantum theory and molecular dynamics approaches up to continuum models. The AI-powered PSPP-chain reasoning process (ii) offers an alternative approach, where every connection is represented by a data-driven model. Models can also leapfrog steps, such as Processing-Property models Wang2024prediction_hydrogel.
  • ...and 3 more figures