Large language models, physics-based modeling, experimental measurements: the trinity of data-scarce learning of polymer properties
Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua, Yue Yu
TL;DR
This work tackles data-scarce learning for polymer property prediction by marrying large language models with physics-based simulations and limited experiments. A Trinity framework leverages physics-grounded synthetic data generated via a Group Contribution model, a reduced-order Fire Dynamics Simulator, and a two-phase training strategy (phase-1 supervised pretraining on synthetic data, phase-2 finetuning on scarce experiments) to impart physics priors and then refine with real measurements. The approach yields substantial improvements in predicting cone calorimeter metrics ($t_{ig}$ and $p_{HRR}$), with phase-1 pretraining providing a strong starting point and phase-2 finetuning delivering large error reductions, alongside uncertainty quantification confirming the reliability of synthetic data. The framework is model-agnostic and generalizable, enabling data-efficient learning across scientific problems by interlacing LLMs, physics-based modeling, and experimental data.
Abstract
Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate finetuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data is sparse.
