LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Yuan Chiang; Elvis Hsieh; Chia-Hong Chou; Janosh Riebesell

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Yuan Chiang, Elvis Hsieh, Chia-Hong Chou, Janosh Riebesell

TL;DR

LLaMP addresses the hallucination and memory limitations of LLMs in materials science by grounding them with a multimodal retrieval-augmented framework built on hierarchical ReAct agents. It enables dynamic data access from Materials Project and other sources, supports high-fidelity material-property inferences, and orchestrates atomistic simulations via ML force fields without finetuning. The paper introduces a self-consistency metric (SCoR) to assess reliability and demonstrates superior performance over baselines on key properties, as well as practical workflows for synthesis, structure editing, and simulations. The approach offers a scalable pathway to knowledge distillation and broader adoption of reliable language-driven materials informatics.

Abstract

Reducing hallucination of Large Language Models (LLMs) is imperative for use in the sciences, where reliability and reproducibility are crucial. However, LLMs inherently lack long-term memory, making it a nontrivial, ad hoc, and inevitably biased task to fine-tune them on domain-specific literature and data. Here we introduce LLaMP, a multimodal retrieval-augmented generation (RAG) framework of hierarchical reasoning-and-acting (ReAct) agents that can dynamically and recursively interact with computational and experimental data on Materials Project (MP) and run atomistic simulations via high-throughput workflow interface. Without fine-tuning, LLaMP demonstrates strong tool usage ability to comprehend and integrate various modalities of materials science concepts, fetch relevant data stores on the fly, process higher-order data (such as crystal structure and elastic tensor), and streamline complex tasks in computational materials and chemistry. We propose a simple metric combining uncertainty and confidence estimates to evaluate the self-consistency of responses by LLaMP and vanilla LLMs. Our benchmark shows that LLaMP effectively mitigates the intrinsic bias in LLMs, counteracting the errors on bulk moduli, electronic bandgaps, and formation energies that seem to derive from mixed data sources. We also demonstrate LLaMP's capability to edit crystal structures and run annealing molecular dynamics simulations using pre-trained machine-learning force fields. The framework offers an intuitive and nearly hallucination-free approach to exploring and scaling materials informatics, and establishes a pathway for knowledge distillation and fine-tuning other language models. Code and live demo are available at https://github.com/chiang-yuan/llamp

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 5 figures, 4 tables)

This paper contains 14 sections, 4 equations, 5 figures, 4 tables.

Introduction
Background
Related Work
Method
Hierarchical orchestration
Self-consistency of response (SCoR)
Experiments
Multimodal ReAct Augmentation
Performance Benchmarks
Real-world Applications
Discussion
Supplementary Information
List of Implemented Assistant Agents and Tools
Prompt Template

Figures (5)

Figure 1: Hierarchical ReAct agent planning in LLaMP. Two levels of agents are deployed using a standardized LangChain interface. Supervisor ReAct agent oversees assistant ReAct agents at the bottom-level, each equipped with distinct toolkits and data/document stores to accomplish various tasks, including high-fidelity materials information retrieval, atomistic modeling and simulations, and literature search. For a detailed example, refer to \ref{['fig:multimodal']}.
Figure 2: LLaMP RAG responses, baseline methods, and LLM intrinsic knowledge on material properties. (a) Bulk moduli, $K$, of 3d transition metals. (b) Formation energies, $\Delta H_f$, of common compounds. (c) Electronic bandgaps, $E_g$, of common intrinsic semiconductors. (d) Electronic bandgaps of multi-element (ternary or quaternary) materials. Missing predictions are marked by shaded areas. Fliers are marked in circles. All LLaMP results use GPT-4 as backend language provider.
Figure 3: Prediction of LLaMP, GPT-3.5, and GPT-4 on (a,b,d,e) magnetic orderings and (c,f) total magnetization per formula unit of randomly selected materials. Confusion matrix presents the number of entries in each class. Colormap represents the percentage of correct classification.
Figure 4: Generation and manipulation of crystal structures using LLMs to insert an additional lithium atom at the interstitial site in diamond cubic silicon structure. Blue: Si. Green: Li. Question-answer pairs are listed in Table \ref{['ex:Si-Li-interstitial']}. Additional atoms extended through bonds are visualized.
Figure A.1: Multimodal retrieval-augmented generation for materials informatics. (a) User query. (b) Supervisor ReAct agent capable of handling multiple assistant agents and high-level reasoning. (c-d) Assistant ReAct agents executing function calling and summarization. (c) MPThermoExpert and (d) MPElasticityExpert have access to the API schemas of thermo and elasticity endpoints on Materials Project, respectively. The selected details are highlighted in red, demonstrating the capabilities of RAG and ReAct implemented in LLaMP. The blue texts show LLaMP assistant ReAct agent can handle API calling errors and self-correct the input query accordingly.

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

TL;DR

Abstract

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)