Large language models in materials science and the need for open-source approaches
Fengxu Yang, Weitong Chen, Jack D. Evans
TL;DR
This paper surveys how large language models are transforming materials science across data extraction, predictive modeling, and autonomous experimentation. It contrasts closed-source and open-source LLMs, showing that open-source ecosystems can match performance while enhancing transparency, reproducibility, and data privacy. Concrete advances include sequence-aware data extraction and platform-level data integration (e.g., MOF-ChemUnity, Material String encoding) and the emergence of agent-based discovery workflows (ChatMOF, Coscientist, ChemAgents, MOFGen). It also highlights challenges in evaluating autonomous, multi-step reasoning systems and calls for standardized benchmarks and open data to foster trustworthy, scalable AI-driven discovery in materials science.
Abstract
Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature , predictive modelling, and multi-agent experimental systems. We highlight how LLMs extract valuable information such as synthesis conditions from text, learn structure-property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed-source commercial models, our benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, reproducibility, cost-effectiveness, and data privacy. As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.
