Table of Contents
Fetching ...

Discovery of Interpretable Physical Laws in Materials via Language-Model-Guided Symbolic Regression

Yifeng Guan, Chuyi Liu, Dongzhan Zhou, Lei Bai, Wan-jian Yin, Jingyuan Li, Mao Su

TL;DR

A framework that guides the search process by leveraging the embedded scientific knowledge of large language models, enabling efficient identification of physical laws in the data, and a set of novel formulas for bulk modulus, band gap, and oxygen evolution reaction activity are identified.

Abstract

Discovering interpretable physical laws from high-dimensional data is a fundamental challenge in scientific research. Traditional methods, such as symbolic regression, often produce complex, unphysical formulas when searching a vast space of possible forms. We introduce a framework that guides the search process by leveraging the embedded scientific knowledge of large language models, enabling efficient identification of physical laws in the data. We validate our approach by modeling key properties of perovskite materials. Our method mitigates the combinatorial explosion commonly encountered in traditional symbolic regression, reducing the effective search space by a factor of approximately $10^5$. A set of novel formulas for bulk modulus, band gap, and oxygen evolution reaction activity are identified, which not only provide meaningful physical insights but also outperform previous formulas in accuracy and simplicity.

Discovery of Interpretable Physical Laws in Materials via Language-Model-Guided Symbolic Regression

TL;DR

A framework that guides the search process by leveraging the embedded scientific knowledge of large language models, enabling efficient identification of physical laws in the data, and a set of novel formulas for bulk modulus, band gap, and oxygen evolution reaction activity are identified.

Abstract

Discovering interpretable physical laws from high-dimensional data is a fundamental challenge in scientific research. Traditional methods, such as symbolic regression, often produce complex, unphysical formulas when searching a vast space of possible forms. We introduce a framework that guides the search process by leveraging the embedded scientific knowledge of large language models, enabling efficient identification of physical laws in the data. We validate our approach by modeling key properties of perovskite materials. Our method mitigates the combinatorial explosion commonly encountered in traditional symbolic regression, reducing the effective search space by a factor of approximately . A set of novel formulas for bulk modulus, band gap, and oxygen evolution reaction activity are identified, which not only provide meaningful physical insights but also outperform previous formulas in accuracy and simplicity.
Paper Structure (9 sections, 7 equations, 5 figures)

This paper contains 9 sections, 7 equations, 5 figures.

Figures (5)

  • Figure 1: Overview of the LangLaw framework. The workflow is organized into four interconnected phases forming a closed loop: LLM inference (top left): The Large Language Model (LLM) acts as a reasoning agent. It analyzes the data description (e.g., crystal structure) and previous experience to generate search constraints, including feature selection, iteration counts, and tree depth. Regression (top right): These parameters act as control signals (purple dashed lines) to guide the Symbolic Regression (PySR) engine. The engine performs evolutionary searches using a parallel island model, evolving formula populations via genetic operations like crossover and mutation (details are shown in Supplementary Note S1). Evaluation (bottom right): Candidate formulas on the Pareto front are screened. The optimal formula is selected based on a score function that balances fitting loss and complexitygp. Then if the error is lower enough or comes to the end round, the formula will be output, else the formula will be added into the Formulas Library. Experience (bottom left): The selected formulas and their performance metrics are stored in a formula library. This knowledge is formatted into prompts to update the LLM's experience, refining the search strategy for subsequent rounds.
  • Figure 2: Performance comparison on the Perovskite Bulk Modulus dataset. This plot shows the complexity and Mean Absolute Error of formulas of Bulk Modulus found by different methods: Verma and Kumar's formula (gray points), LangLaw (green points), HI-SISSO (blue points) and LLM-SR (yellow points). The gray line is the Pareto front.
  • Figure 3: Evaluation of out of distribution generalization capability. The bar chart compares the absolute prediction error of our discovered linear formula against the high-complexity HI-SISSO model on 10 screened perovskite materials not seen during training. Our method (green bars) consistently yields lower prediction errors across diverse compositions compared to HI-SISSO (blue bars), demonstrating superior transferability and robustness in data-scarce scenarios.
  • Figure 4: Performance comparison on the Band Gap dataset. This plot shows the complexity and Mean Absolute Error of formulas of Band Gap found by LangLaw (green points) and SISSO (blue points). The gray line is the Pareto front.
  • Figure 5: Performance comparison on the OER activity dataset. This plot shows the complexity and Mean Absolute Error of formulas of OER activity found by LangLaw (green points) and GPSR (blue points). The gray line is the Pareto front.