Table of Contents
Fetching ...

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, Adji Bousso Dieng

TL;DR

This work tackles the lack of standardized benchmarks for LLM-based materials property prediction by introducing LLM4Mat-Bench, a large, multi-source dataset with ~2.7 million structure files and 1,978,985 composition-description samples across 65 properties, spanning three input modalities: Composition, CIF, and crystal text descriptions. It provides fixed train/validation/test splits and evaluates diverse models from small task-specific LLMs to large conversational LLMs, using zero-shot and few-shot prompts. The results show that small, task-specific predictive LLMs (e.g., LLM-Prop, MatBERT) generally outperform larger general-purpose LLMs, especially when descriptions are used as inputs, and they reveal significant limitations of current general LLMs in accurately predicting materials properties. The study underscores the need for task-tuned, instruction-guided LLMs and standardized benchmarks to accelerate reliable materials property prediction and discovery, informing future directions in dataset design and model specialization.

Abstract

Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

TL;DR

This work tackles the lack of standardized benchmarks for LLM-based materials property prediction by introducing LLM4Mat-Bench, a large, multi-source dataset with ~2.7 million structure files and 1,978,985 composition-description samples across 65 properties, spanning three input modalities: Composition, CIF, and crystal text descriptions. It provides fixed train/validation/test splits and evaluates diverse models from small task-specific LLMs to large conversational LLMs, using zero-shot and few-shot prompts. The results show that small, task-specific predictive LLMs (e.g., LLM-Prop, MatBERT) generally outperform larger general-purpose LLMs, especially when descriptions are used as inputs, and they reveal significant limitations of current general LLMs in accurately predicting materials properties. The study underscores the need for task-tuned, instruction-guided LLMs and standardized benchmarks to accelerate reliable materials property prediction and discovery, informing future directions in dataset design and model specialization.

Abstract

Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 28 tables.

Figures (6)

  • Figure 1: The performance comparison across models for each material representation is presented. The left y-axis shows the log-normalized performance of each LLM-based model relative to the baseline (CGCNN), while the right y-axis (bar plots) displays the average subword tokens per sample for each dataset. Datasets on the x-axis are ordered by increasing average subword tokens. Results for some chat-like models are missing in each subplot due to invalid outputs on at least one of the property. Higher values in the line plots indicate better performance. Panels (a), (b), and (c) represents the performance comparison where the input is a chemical composition, CIF, and structure description, respectively.
  • Figure 2: The performance comparison across material representations for each LLM-based model is shown. The y-axis represents the log-normalized Weighted Average (MAD:MAE) score for each representation, while the x-axis displays randomly ordered datasets. In the (a)-(d) plots, some Composition and Structure performance results are missing due to invalid outputs. A higher y-axis value indicates better performance. Panels (a) to (f) represents the results for Llama 2-7b-chat:0S, Llama 2-7b-chat:5S, Mistral 7b-Instruct-v0.1:5S, Gemma 2-9b-it:5S, MatBERT, and LLM-Prop, respectively.
  • Figure 3: The performance comparison of different chat-based LLM versions is presented with results based on 5-shot prompts, averaged over three inference runs. Panels (a)–(c) and (d)–(f) show each model's accuracy in predicting band gaps and stability in the MP dataset, respectively, while panels (g)–(i) and (j)–(l) depict the percentage of valid predictions for band gap and stability on the test set.
  • Figure 4: Prompt templates when the input is a chemical formula.
  • Figure 5: Prompt templates when the input is a CIF file.
  • ...and 1 more figures