Table of Contents
Fetching ...

Crystal Structure Generation with Autoregressive Large Language Modeling

Luis M. Antunes, Keith T. Butler, Ricardo Grau-Crespo

TL;DR

CrystaLLM presents a CIF-based autoregressive transformer trained on millions of inorganic crystal structures to generate plausible CIFs and 3D atomic arrangements, addressing CSP's computational bottlenecks. By coupling CIF-based generation with a pre-trained energy predictor and Monte Carlo Tree Search, the approach yields higher-quality, lower-energy candidates and demonstrates generalization to unseen compositions and symmetry settings. In benchmarks against diffusion-based CSP methods, CrystaLLM achieves competitive RMSE and match rates, and uniquely supports symmetry-conditioned generation and potential fine-tuning for property prediction. The work suggests a scalable path toward accelerated materials discovery, with a publicly accessible web tool for rapid crystal structure generation and validation, while acknowledging current limitations such as disordered site occupancy and dataset heterogeneity.

Abstract

The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. Quickly generating and predicting inorganic crystal structures is important for the discovery of new materials, which can target applications such as energy or electronic devices. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. The integration with predictors of formation energy permits the use of a Monte Carlo Tree Search algorithm to improve the generation of meaningful structures. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective 'world models' of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.

Crystal Structure Generation with Autoregressive Large Language Modeling

TL;DR

CrystaLLM presents a CIF-based autoregressive transformer trained on millions of inorganic crystal structures to generate plausible CIFs and 3D atomic arrangements, addressing CSP's computational bottlenecks. By coupling CIF-based generation with a pre-trained energy predictor and Monte Carlo Tree Search, the approach yields higher-quality, lower-energy candidates and demonstrates generalization to unseen compositions and symmetry settings. In benchmarks against diffusion-based CSP methods, CrystaLLM achieves competitive RMSE and match rates, and uniquely supports symmetry-conditioned generation and potential fine-tuning for property prediction. The work suggests a scalable path toward accelerated materials discovery, with a publicly accessible web tool for rapid crystal structure generation and validation, while acknowledging current limitations such as disordered site occupancy and dataset heterogeneity.

Abstract

The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. Quickly generating and predicting inorganic crystal structures is important for the discovery of new materials, which can target applications such as energy or electronic devices. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. The integration with predictors of formation energy permits the use of a Monte Carlo Tree Search algorithm to improve the generation of meaningful structures. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective 'world models' of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.
Paper Structure (28 sections, 1 equation, 5 figures, 6 tables)

This paper contains 28 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: a Core concepts in training a Large Language Model of CIF files: A CIF file (left) is converted into a sequence of symbols, through tokenization. The sequence is processed by the model, which produces a list of probability distributions over the vocabulary, for each corresponding symbol in the input. The resulting predicted probability distributions are evaluated against the target distributions (which contain the entire probability mass on the correct subsequent token), using the cross-entropy loss metric. The target tokens are the input tokens shifted one spot to the left, as the objective is to predict the next token given a sequence of preceding tokens. The tokens are categorized as CIF tags (blue), atoms (green), numeric digits (gold), and punctuation (red). Output tokens (not actually sampled during training) represent the tokens assigned the highest probability by the model. Underlined tokens represent predicted distributions assigning a relatively low probability to the correct next token. b Generation of a CIF file: First, a prompt is constructed by concatenating the symbol data_ with the desired cell composition, which is then tokenized and processed by the model. Next, a token is sampled from the predicted distribution for the upcoming token in the sequence. Finally, the sampled token is added to the accumulating contents of the CIF file. This procedure continues iteratively until a predefined terminating condition is met (e.g. two consecutive newline tokens are sampled).
  • Figure 2: a The generated cell lengths for matching structures of the test set vs. the true cell lengths, when space group is included. b The generated cell volumes for matching structures of the test set vs. either the true cell volumes, or the cell volumes implied from the generated cell parameters, when space group is included.
  • Figure 3: The generated structures of various inorganic compounds. aBa2MnCr. Cell parameters: $a$, $b$: 3.778 Å, $c$: 27.503 Å, $\alpha$, $\beta$: 90.0°, $\gamma$: 120.0°. Color scheme: Ba: green, Mn: purple, Cr: blue. bCsCuTePt. Cell parameters: $a$, $b$, $c$: 7.153 Å, $\alpha$, $\beta$, $\gamma$: 90.0°. Color scheme: Cs: purple, Cu: blue, Te: gold, Pt: white. cYbMn6Sn6. Cell parameters: $a$, $b$: 5.488 Å, $c$: 8.832 Å, $\alpha$, $\beta$: 90.0°, $\gamma$: 120.0°. ZrMn6Sn6, in the training set, possessed the same structure, but with the following cell parameters: $a$, $b$: 5.364 Å, $c$: 8.933 Å, $\alpha$, $\beta$: 90.0°, $\gamma$: 120.0°. Color scheme: Yb: green, Mn: magenta, Sn: grey. dAuO2. Cell parameters: $a$, $b$: 4.838 Å, $c$: 3.429 Å, $\alpha$, $\beta$, $\gamma$: 90.0°. Color scheme: Au: yellow, O: red. eSm2BS4. Cell parameters: $a$, $b$, $c$: 10.884 Å, $\alpha$, $\beta$, $\gamma$: 90.0°. Color scheme: Sm: light green, B: green, S: yellow. fKRb2TiF6. Cell parameters: $a$, $b$, $c$: 8.688 Å, $\alpha$, $\beta$, $\gamma$: 90.0°. Color scheme: K: white, Rb: purple, Ti: brown, F: green. gLiTa2NiSe5 ($a$: 3.517 Å, $b$: 13.362 Å, $c$: 15.156 Å, $Z$=4), which resembles the recently reported structure in hyde2023lithium. hTa2NiSe5, seen in training. iNaSn2CuSe5, seen in training.
  • Figure 4: The generated vs. DFT-derived value of the cell parameter $a$ for selected pyrochlores not in the training dataset. The error bars represent the $\pm$ standard deviation of the value of the $a$ cell parameter for the three generation attempts (all of which resulted in the pyrochlore structure), while the $y$-coordinate of the points represents the mean value of the cell parameter across the three attempts. The inset represents the structure of the generated pyrochlore Pr2Mn2O7, with cell parameters $a$, $b$, $c$: 10.34 Å, $\alpha$, $\beta$, $\gamma$: 90.0°. Color scheme: Pr = yellow, Mn = purple, O = red.
  • Figure 5: Schematic depiction of the Monte Carlo Tree Search decoding procedure. CIF files are generated as a tree is iteratively constructed, with each iteration guiding the generation of subsequent structures towards more desirable parameters (e.g. lower formation energy per atom). The nodes in the tree represent the cumulative contents of a CIF file at various points. a The Selection step involves descending the tree by choosing the most promising node at each level, using a variant of the PUCT algorithm. b During Expansion, an unexplored child node is randomly selected and added to the tree. If a node has only one highly probable child (represented as empty nodes), the child node bypasses the Rollout step. c The Rollout step involves prompting the model with the contents of the selected node, and sampling from the model until a terminal condition is met, so as to obtain a complete CIF file and an estimate of the value of a node. d The generated structure is validated and scored, incorporating the prediction of the structure's formation energy per atom, as given by a pre-trained neural network. e Finally, the score is backpropagated through the selected nodes, which store the accumulated results of each iteration. The resulting generated CIF file, if valid, is returned.