Table of Contents
Fetching ...

Less can be more for predicting properties with large language models

Nawaf Alampara, Santiago Miret, Kevin Maik Jablonka

TL;DR

This work investigates the fundamental ability of large language models to learn coordinate information when predicting properties from coordinate-category data, a common setup in materials science. Using a physics-inspired synthetic framework with the controllable potential E(α)=αE_category+(1−α)E_coordinate and a comprehensive MatText benchmarking suite, the authors show that LLMs reliably learn category patterns but remain geometrically blind to spatial arrangements, a gap that persists across architectures, scale, and data volume. They contrast with n-gram baselines and demonstrate that scaling data or model size does not close the coordinate-learning gap, and that graph-based geometric models (e.g., GNNs) significantly outperform LLMs on coordinate-dominated properties. The results advocate a biased, architecture-aware approach to scientific prediction tasks, emphasizing the importance of inductive biases and geometry-aware models over universal language-model replacements for materials-property prediction. The open MatText framework and accompanying analyses provide concrete guidelines for model selection and representation choice in geometry-rich scientific domains.

Abstract

Predicting properties from coordinate-category data -- sets of vectors paired with categorical information -- is fundamental to computational science. In materials science, this challenge manifests as predicting properties like formation energies or elastic moduli from crystal structures comprising atomic positions (vectors) and element types (categorical information). While large language models (LLMs) have increasingly been applied to such tasks, with researchers encoding structural data as text, optimal strategies for achieving reliable predictions remain elusive. Here, we report fundamental limitations in LLM's ability to learn from coordinate information in coordinate-category data. Through systematic experiments using synthetic datasets with tunable coordinate and category contributions, combined with a comprehensive benchmarking framework (MatText) spanning multiple representations and model scales, we find that LLMs consistently fail to capture coordinate information while excelling at category patterns. This geometric blindness persists regardless of model size (up to 70B parameters), dataset scale (up to 2M structures), or text representation strategy. Our findings suggest immediate practical implications: for materials property prediction tasks dominated by structural effects, specialized geometric architectures consistently outperform LLMs by significant margins, as evidenced by a clear "GNN-LM wall" in performance benchmarks. Based on our analysis, we provide concrete guidelines for architecture selection in scientific machine learning, while highlighting the critical importance of understanding model inductive biases when tackling scientific prediction problems.

Less can be more for predicting properties with large language models

TL;DR

This work investigates the fundamental ability of large language models to learn coordinate information when predicting properties from coordinate-category data, a common setup in materials science. Using a physics-inspired synthetic framework with the controllable potential E(α)=αE_category+(1−α)E_coordinate and a comprehensive MatText benchmarking suite, the authors show that LLMs reliably learn category patterns but remain geometrically blind to spatial arrangements, a gap that persists across architectures, scale, and data volume. They contrast with n-gram baselines and demonstrate that scaling data or model size does not close the coordinate-learning gap, and that graph-based geometric models (e.g., GNNs) significantly outperform LLMs on coordinate-dominated properties. The results advocate a biased, architecture-aware approach to scientific prediction tasks, emphasizing the importance of inductive biases and geometry-aware models over universal language-model replacements for materials-property prediction. The open MatText framework and accompanying analyses provide concrete guidelines for model selection and representation choice in geometry-rich scientific domains.

Abstract

Predicting properties from coordinate-category data -- sets of vectors paired with categorical information -- is fundamental to computational science. In materials science, this challenge manifests as predicting properties like formation energies or elastic moduli from crystal structures comprising atomic positions (vectors) and element types (categorical information). While large language models (LLMs) have increasingly been applied to such tasks, with researchers encoding structural data as text, optimal strategies for achieving reliable predictions remain elusive. Here, we report fundamental limitations in LLM's ability to learn from coordinate information in coordinate-category data. Through systematic experiments using synthetic datasets with tunable coordinate and category contributions, combined with a comprehensive benchmarking framework (MatText) spanning multiple representations and model scales, we find that LLMs consistently fail to capture coordinate information while excelling at category patterns. This geometric blindness persists regardless of model size (up to 70B parameters), dataset scale (up to 2M structures), or text representation strategy. Our findings suggest immediate practical implications: for materials property prediction tasks dominated by structural effects, specialized geometric architectures consistently outperform LLMs by significant margins, as evidenced by a clear "GNN-LM wall" in performance benchmarks. Based on our analysis, we provide concrete guidelines for architecture selection in scientific machine learning, while highlighting the critical importance of understanding model inductive biases when tackling scientific prediction problems.

Paper Structure

This paper contains 45 sections, 12 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Illustration of data point various levels of information. (a) Coordinate information consists of continuous positional coordinates. (b) Category information consists of discrete categorical labels. (c) Coordinate-category data require joint modeling of both positional coordinates and discrete categorical labels
  • Figure 2: Illustration of Coordinate-Category Cliff () (A). Coordinate contribution () and Category contribution ( for different datasets (B). The coordinate contribution and category contribution are defined as: $L_{\mathrm{coord}, \alpha}= \sum_{\alpha \in \alpha_g} \text{loss}(\alpha) - 3 \times \text{loss}(0.5)$ and $L_{\mathrm{cat}, \alpha} = \sum_{\alpha \in \alpha_c} \text{loss}(\alpha) - 3 \times \text{loss}(0.5)$, where $\alpha_g = \{0, 0.2, 0.4\}$ and $\alpha_c = \{0.6, 0.8, 1.0\}$. Accounting for the three $\alpha$ values included in each sum we subtract $\text{loss}(0.5)$ three times (which effectively means we subtract the loss at $\alpha$=0.5 from each individual loss contributing to the sum). The plot on the right shows the and for six different datasets. For all datasets, the language model shows a positive , i.e., $>$ . Which is a gap in learning coordinate computations compared to category computations. There is also variation in the cliff magnitude across different datasets.
  • Figure 3: in hypothetical potential energy prediction as a function of binning the potential across different datasets. The figure illustrates the for models tasked with predicting hypothetical potential energy. The analysis is performed across three distinct datasets. The is plotted against the number of bins used to discretize the pairwise distance (logarithmic scale). Fewer bins correspond to a coarser, generally easier potential energy prediction task for the structures within that dataset, while a higher number of bins represents a finer-grained, more challenging prediction task. Each line corresponds to a different text input (with different levels of information) used to describe the data points in the dataset: categorical (purple), and coordinate and categorical (red). Solid lines indicate Transformer-based models, and dotted lines represent their n-gram counterparts. The plot shows almost identical behavior of language models to that of n-gram models suggesting that inherit the properties of n-gram for tasks involving coordinate and category data.
  • Figure 4: The above figure outlines an overview of the MatText framework. MatText is a holistic platform that supports end-to-end language modeling of materials, creation of representations, model training, and streamlined analysis of results. MatText enables the creation of text representations of crystal structures, offering nine different representations, each with distinct inductive biases. These inductive biases explicitly encode diverse types of information, such as bonding, periodicity, symmetry, and other shown in the middle section. The framework also support various tokenization methods, such as atom-level and representation-specific tokenization, as well as different ways to tokenize numbers. MatText facilitates pretraining and finetuning of both causal and masked language models, and features modules for scaling up model and data sizes. Additionally, MatText provides tools for analyzing attention mechanisms, assessing the contribution of tokens in predictions based on attention scores, and performing analysis using hypothetical potentials.
  • Figure 5: Predictive performance of in materials property prediction tasks (shear modulus ($\mu$), bulk modulus ($K$), and perovskite formation energy ($E_f$)) for different representations. Performance is measured by , with lower values indicating better performance. Representations are grouped by type: compositional-based (orange), local environment-based (red), and geometry-based (purple). Across all three properties, local environment based representations generally achieve the best performance, while explicit geometric representations show limited improvement or even degraded performance. Notably, the SLICES representation, which lacks explicit coordinate information, performs comparably to geometry-aware representations like Cif P$_1$, suggesting that current do not effectively leverage explicit coordinate information for materials property prediction. The error bars indicate the standard deviation across five-fold cross-validation. A notable exception is the perovskites dataset, where there is a big difference between representation. This dataset has few unique chemical environments compared to the shear and bulk modulus datasets (see \ref{['fig:structural_variations_all_property']}), wherefore most of the variance cannot be explained using composition information alone.
  • ...and 7 more figures