Table of Contents
Fetching ...

Large language models, physics-based modeling, experimental measurements: the trinity of data-scarce learning of polymer properties

Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua, Yue Yu

TL;DR

This work tackles data-scarce learning for polymer property prediction by marrying large language models with physics-based simulations and limited experiments. A Trinity framework leverages physics-grounded synthetic data generated via a Group Contribution model, a reduced-order Fire Dynamics Simulator, and a two-phase training strategy (phase-1 supervised pretraining on synthetic data, phase-2 finetuning on scarce experiments) to impart physics priors and then refine with real measurements. The approach yields substantial improvements in predicting cone calorimeter metrics ($t_{ig}$ and $p_{HRR}$), with phase-1 pretraining providing a strong starting point and phase-2 finetuning delivering large error reductions, alongside uncertainty quantification confirming the reliability of synthetic data. The framework is model-agnostic and generalizable, enabling data-efficient learning across scientific problems by interlacing LLMs, physics-based modeling, and experimental data.

Abstract

Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate finetuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data is sparse.

Large language models, physics-based modeling, experimental measurements: the trinity of data-scarce learning of polymer properties

TL;DR

This work tackles data-scarce learning for polymer property prediction by marrying large language models with physics-based simulations and limited experiments. A Trinity framework leverages physics-grounded synthetic data generated via a Group Contribution model, a reduced-order Fire Dynamics Simulator, and a two-phase training strategy (phase-1 supervised pretraining on synthetic data, phase-2 finetuning on scarce experiments) to impart physics priors and then refine with real measurements. The approach yields substantial improvements in predicting cone calorimeter metrics ( and ), with phase-1 pretraining providing a strong starting point and phase-2 finetuning delivering large error reductions, alongside uncertainty quantification confirming the reliability of synthetic data. The framework is model-agnostic and generalizable, enabling data-efficient learning across scientific problems by interlacing LLMs, physics-based modeling, and experimental data.

Abstract

Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate finetuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data is sparse.
Paper Structure (4 sections, 3 equations, 5 figures, 1 table)

This paper contains 4 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: An overview of the proposed trinity framework for data-scarce learning of LLMs.
  • Figure 2: Using group contribution to generate structurally admissible hypothetical polymers with physically meaningful material properties.
  • Figure 3: Demonstration of the synthetic data generation workflow.
  • Figure 4: Quantifying uncertainties on the generated synthetic data: (a) general workflow, (b) the computed probability distributions of physics-based variables, and (c) trust region vs. experimental measurement plots.
  • Figure 5: MoLFormer training results for cone calorimeter metrics prediction.