Less can be more for predicting properties with large language models
Nawaf Alampara, Santiago Miret, Kevin Maik Jablonka
TL;DR
This work investigates the fundamental ability of large language models to learn coordinate information when predicting properties from coordinate-category data, a common setup in materials science. Using a physics-inspired synthetic framework with the controllable potential E(α)=αE_category+(1−α)E_coordinate and a comprehensive MatText benchmarking suite, the authors show that LLMs reliably learn category patterns but remain geometrically blind to spatial arrangements, a gap that persists across architectures, scale, and data volume. They contrast with n-gram baselines and demonstrate that scaling data or model size does not close the coordinate-learning gap, and that graph-based geometric models (e.g., GNNs) significantly outperform LLMs on coordinate-dominated properties. The results advocate a biased, architecture-aware approach to scientific prediction tasks, emphasizing the importance of inductive biases and geometry-aware models over universal language-model replacements for materials-property prediction. The open MatText framework and accompanying analyses provide concrete guidelines for model selection and representation choice in geometry-rich scientific domains.
Abstract
Predicting properties from coordinate-category data -- sets of vectors paired with categorical information -- is fundamental to computational science. In materials science, this challenge manifests as predicting properties like formation energies or elastic moduli from crystal structures comprising atomic positions (vectors) and element types (categorical information). While large language models (LLMs) have increasingly been applied to such tasks, with researchers encoding structural data as text, optimal strategies for achieving reliable predictions remain elusive. Here, we report fundamental limitations in LLM's ability to learn from coordinate information in coordinate-category data. Through systematic experiments using synthetic datasets with tunable coordinate and category contributions, combined with a comprehensive benchmarking framework (MatText) spanning multiple representations and model scales, we find that LLMs consistently fail to capture coordinate information while excelling at category patterns. This geometric blindness persists regardless of model size (up to 70B parameters), dataset scale (up to 2M structures), or text representation strategy. Our findings suggest immediate practical implications: for materials property prediction tasks dominated by structural effects, specialized geometric architectures consistently outperform LLMs by significant margins, as evidenced by a clear "GNN-LM wall" in performance benchmarks. Based on our analysis, we provide concrete guidelines for architecture selection in scientific machine learning, while highlighting the critical importance of understanding model inductive biases when tackling scientific prediction problems.
