Table of Contents
Fetching ...

Probing Materials Knowledge in LLMs: From Latent Embeddings to Reliable Predictions

Vineeth Venugopal, Soroush Mahjoubi, Elsa Olivetti

Abstract

Large language models are increasingly applied to materials science, yet fundamental questions remain about their reliability and knowledge encoding. Evaluating 25 LLMs across four materials science tasks -- over 200 base and fine-tuned configurations -- we find that output modality fundamentally determines model behavior. For symbolic tasks, fine-tuning converges to consistent, verifiable answers with reduced response entropy, while for numerical tasks, fine-tuning improves prediction accuracy but models remain inconsistent across repeated inference runs, limiting their reliability as quantitative predictors. For numerical regression, we find that better performance can be obtained by extracting embeddings directly from intermediate transformer layers than from model text output, revealing an ``LLM head bottleneck,'' though this effect is property- and dataset-dependent. Finally, we present a longitudinal study of GPT model performance in materials science, tracking four models over 18 months and observing 9--43\% performance variation that poses reproducibility challenges for scientific applications.

Probing Materials Knowledge in LLMs: From Latent Embeddings to Reliable Predictions

Abstract

Large language models are increasingly applied to materials science, yet fundamental questions remain about their reliability and knowledge encoding. Evaluating 25 LLMs across four materials science tasks -- over 200 base and fine-tuned configurations -- we find that output modality fundamentally determines model behavior. For symbolic tasks, fine-tuning converges to consistent, verifiable answers with reduced response entropy, while for numerical tasks, fine-tuning improves prediction accuracy but models remain inconsistent across repeated inference runs, limiting their reliability as quantitative predictors. For numerical regression, we find that better performance can be obtained by extracting embeddings directly from intermediate transformer layers than from model text output, revealing an ``LLM head bottleneck,'' though this effect is property- and dataset-dependent. Finally, we present a longitudinal study of GPT model performance in materials science, tracking four models over 18 months and observing 9--43\% performance variation that poses reproducibility challenges for scientific applications.
Paper Structure (23 sections, 1 equation, 7 figures, 2 tables)

This paper contains 23 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Performance versus response entropy for symbolic tasks. MatKG link prediction (left) and crystal system classification (right). Circles indicate base models; squares indicate fine-tuned models. Marker size is proportional to model parameter count, and colors denote model families. Dashed lines show mean performance and entropy for base models. Fine-tuning dramatically improves accuracy while reducing response entropy across all model families.
  • Figure 2: Performance versus response entropy for numerical regression tasks. Bandgap prediction (left) and dielectric constant prediction (right). Visualization follows the same conventions as Figure \ref{['fig:classification']}. Lower RMSE indicates better performance. Fine-tuning generally reduces prediction error, with variable effects on entropy across models.
  • Figure 3: Cross-task transfer matrix. Each cell shows the effect of fine-tuning on a source task (columns) when evaluated on a target task (rows), compared to base model performance. Color coding: dark green (+2) indicates statistically significant improvement; light green (+1) indicates improvement within noise; light red (--1) indicates degradation within noise; dark red (--2) indicates significant degradation; grey (0) indicates no data. Statistical significance is determined by comparison to base model standard deviation across inference runs.
  • Figure 4: Factors determining MatKG link prediction accuracy. (a) Accuracy binned by subject frequency (wide bars) and object answer frequency (narrow bars) for both base and fine-tuned models. The fine-tuned model shows strong monotonic dependence on object answer frequency while subject frequency has no effect; the base model remains flat at $\sim$4--5% regardless of frequency. (b) Performance breakdown by relation target category, showing that Descriptors (DSC) and Applications (APL) are easiest to predict ($\sim$70% accuracy) while Symmetry Phase Labels (SPL), Synthesis Methods (SMT) and Materials (CHM) are harder ($\sim$55%).
  • Figure 5: Layer-wise embedding probes for property prediction. Test RMSE as a function of transformer layer for bandgap (top row) and dielectric constant (bottom row) across three model families, shown for ridge regression (left) and two-layer neural network (right). All three probe architectures (ridge, single-layer NN, two-layer NN) were tested with consistent results; the single-layer NN is omitted for clarity. Horizontal dashed lines indicate fine-tuned LLM text generation performance. For bandgap, intermediate layer embeddings match or exceed text output performance, suggesting a verbalization bottleneck. For dielectric constant, embeddings consistently underperform text generation by approximately 3$\times$, indicating property-dependent knowledge encoding.
  • ...and 2 more figures