LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction
Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, Adji Bousso Dieng
TL;DR
This work tackles the lack of standardized benchmarks for LLM-based materials property prediction by introducing LLM4Mat-Bench, a large, multi-source dataset with ~2.7 million structure files and 1,978,985 composition-description samples across 65 properties, spanning three input modalities: Composition, CIF, and crystal text descriptions. It provides fixed train/validation/test splits and evaluates diverse models from small task-specific LLMs to large conversational LLMs, using zero-shot and few-shot prompts. The results show that small, task-specific predictive LLMs (e.g., LLM-Prop, MatBERT) generally outperform larger general-purpose LLMs, especially when descriptions are used as inputs, and they reveal significant limitations of current general LLMs in accurately predicting materials properties. The study underscores the need for task-tuned, instruction-guided LLMs and standardized benchmarks to accelerate reliable materials property prediction and discovery, informing future directions in dataset design and model specialization.
Abstract
Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
