Materials science in the era of large language models: a perspective

Ge Lei; Ronan Docherty; Samuel J. Cooper

Materials science in the era of large language models: a perspective

Ge Lei, Ronan Docherty, Samuel J. Cooper

Abstract

Large Language Models (LLMs) have garnered considerable interest due to their impressive natural language capabilities, which in conjunction with various emergent properties make them versatile tools in workflows ranging from complex code generation to heuristic finding for combinatorial problems. In this paper we offer a perspective on their applicability to materials science research, arguing their ability to handle ambiguous requirements across a range of tasks and disciplines mean they could be a powerful tool to aid researchers. We qualitatively examine basic LLM theory, connecting it to relevant properties and techniques in the literature before providing two case studies that demonstrate their use in task automation and knowledge extraction at-scale. At their current stage of development, we argue LLMs should be viewed less as oracles of novel insight, and more as tireless workers that can accelerate and unify exploration across domains. It is our hope that this paper can familiarise material science researchers with the concepts needed to leverage these tools in their own research.

Materials science in the era of large language models: a perspective

Abstract

Paper Structure (26 sections, 1 equation, 10 figures)

This paper contains 26 sections, 1 equation, 10 figures.

Introduction
LLM theory: from attention to ChatGPT
Attention and transformers
Pretraining and language modelling
Aligning outputs via RLHF
Capabilities of LLMs in research
LLM properties: intrinsic and emergent
Resulting workflows
LLM workflows in materials science: two case studies
Case study 1: automated 3D microstructure analysis
Case study 2: labelled microstructure dataset collection
Issues and challenges
Conclusion
Extended LLM theory
Attention
...and 11 more sections

Figures (10)

Figure 1: A multi-scale diagram of an LLM. (a) shows an attention map for an example sentence, note how 'Law' is strongly linked to its pronoun 'its'. (b) shows a transformer encoder layer, made up of an attention layer and (fully-connected) feed-forward layer. Multiple of these encoder layers with associated decoder layers form an LLM in (c), which is pretrained in an self-supervised manner on a large text corpus. This LLM is fine-tuned to ensure its responses better match human preferences without diverging too much from the original model via RLHF, as shown in (d). Figures (a), (b) adapted from ATTN_IS_ALL_YOU_NEED and (c), (d) adapted from H_FACE_RLHF.
Figure 2: Diagram of LLM capabilities explored in Section \ref{['sec:capabilities']} and potential materials-science related applications. These emergent capabilities can be combined with each other and integrated into traditional pipelines (genetic algorithms, databases, etc.) to form the different applications.
Figure 3: Diagram of the FunSearchFUNSEARCH evolutionary workflow, where an LLM is prompted with a problem specification and best example heuristics from the previous iteration and tasked with combining them to generate better candidate heuristics to solve a problem. These new heuristics are evaluated, stored in database and the process repeated. This process was able to discover a new upper bound for the largest cap set in 8 dimensions. Taken from FUNSEARCH.
Figure 4: Diagram of MicroGPT's workflow, beginning with dataset collection and filtering. This is followed by tool making and using to extract metrics from the data - this can be an existing tool from its toolkit like tortuosity calculations or created for the specific query.
Figure 5: T-SNE plot of the MatSciBERTMATSCIBERT embeddings of the 'material' label assigned by the LLM to each micrograph in the dataset based on the paper abstract and figure caption. Border colour denotes the instrument the micrograph was taken with. Similar materials are grouped together: nanoparticles in the bottom right, energy materials in the middle on the left and quantum dots in the bottom-left corner. Best viewed zoomed in.
...and 5 more figures

Materials science in the era of large language models: a perspective

Abstract

Materials science in the era of large language models: a perspective

Authors

Abstract

Table of Contents

Figures (10)