Table of Contents
Fetching ...

Training-Free Active Learning Framework in Materials Science with Large Language Models

Hongchen Wang, Rafael Espinosa Castañeda, Jay R. Werber, Yao Fehlis, Edward Kim, Jason Hattrick-Simpers

TL;DR

The paper tackles data-efficiency in materials discovery by replacing traditional ML surrogates in active learning with a training-free LLM-based framework (LLM-AL). It evaluates two prompt styles—parameter-format and report-format—across four diverse datasets, showing that LLM-AL often reaches optimal candidates with substantially less data than conventional baselines and with robustness to non-determinism. The study provides insights into when prompts should be concise versus descriptive and reveals an exploratory acquisition behavior in LLM-AL that can outperform standard uncertainty-based methods. Overall, LLM-AL emerges as a generalizable, tuning-free approach for efficient, interpretable experiment selection and potential autonomous discovery.

Abstract

Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.

Training-Free Active Learning Framework in Materials Science with Large Language Models

TL;DR

The paper tackles data-efficiency in materials discovery by replacing traditional ML surrogates in active learning with a training-free LLM-based framework (LLM-AL). It evaluates two prompt styles—parameter-format and report-format—across four diverse datasets, showing that LLM-AL often reaches optimal candidates with substantially less data than conventional baselines and with robustness to non-determinism. The study provides insights into when prompts should be concise versus descriptive and reveals an exploratory acquisition behavior in LLM-AL that can outperform standard uncertainty-based methods. Overall, LLM-AL emerges as a generalizable, tuning-free approach for efficient, interpretable experiment selection and potential autonomous discovery.

Abstract

Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.

Paper Structure

This paper contains 12 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The trajectories of the running best performance for the LLM-AL approach across four datasets: (A) matbench_steels, (B) P3HT/CNT, (C) Perovskite, and (D) Membrane. Each dataset has a maximization goal for its target property, except the Perovskite dataset which has a minimization goal. Orange lines indicate runs using the parameter-format input prompt, and green lines indicate runs using the report-format prompt. For each setting, five random seeds and five repeated runs were performed, for a total of 10 runs. The thicker lines represent the average running best trajectory across the 10 runs. Vertical dashed lines mark the mean iteration at which the stopping criterion is first reached, with shaded regions showing one standard deviation. The annotated percentages indicate the fraction of the dataset used to reach the stopping point.
  • Figure 2: The number of iterations required for the LLM-AL approach to reach the optimal target across four datasets: (A) matbench_steels, (B) P3HT/CNT, (C) Perovskite, and (D) Membrane. Blue boxplots represent repeated runs using the same random seed (seed 42, five repeats), while pink boxplots represent runs with different random seeds (seeds 38–-42). Individual points show the results for each run, color-coded by seed. Each dataset is divided into two groups corresponding to the parameter-format input prompts and report-format input prompts.
  • Figure 3: The trajectories of the running best performance for LLM-AL and traditional ML models across four datasets: (A) matbench_steels, (B) P3HT/CNT, (C) Perovskite, and (D) Membrane. Bold orange and green lines represent LLM-AL runs using the parameter-format and report-format prompts, respectively. Other colored dashed lines represent traditional ML models and random walk: GPR (blue), RFR (green), XGB (purple), BNN (red), and random walk (pale blue). For LLM-AL, shaded regions indicate the variability across 10 runs, consisting of five repeated runs using a fixed random seed (42) and five distinct random seeds (38--42) used to define the initial data pool. For traditional ML models, shaded regions capture the combined variability across five random seeds (38--42) and a range of UCB trade-off values ($\alpha = 0$ to $5$), which control the exploration--exploitation balance during experiment selection. Vertical dashed lines indicate the latest iteration where the stopping criteria were met, and annotated percentages indicate the fraction of the total dataset required to reach that point.
  • Figure 4: Iterations required to reach the best-performing candidate for LLM-AL compared to traditional ML models across four datasets: A) matbench_steels, B) P3HT/CNT, C) Perovskite, and D) Membrane. For LLM-AL (left panels), the box plots show the distribution of iterations required to reach the maximum target value across five random seeds (38–42) and five repeated runs at seed 42. For traditional ML models (right panels), the line plots show the mean iterations required to reach the maximum target value as a function of the UCB exploration--exploitation trade-off parameter, $\alpha$, with shaded regions representing the standard deviation across the same five seeds.
  • Figure 5: Average cumulative L2 distance traveled in standardized feature space as a function of iteration across four datasets: A) matbench_steels, B) P3HT/CNT, C) Perovskite, and D) Membrane. Each line represents the mean cumulative L2 trajectory length across multiple random seeds, with shaded regions showing the standard deviation. The figure compares LLM-AL using parameter and report prompting formats against traditional ML models (GPR, RFR, XGB, BNN) and a Random Walk baseline.
  • ...and 1 more figures