Table of Contents
Fetching ...

Discovery and recovery of crystalline materials with property-conditioned transformers

Cyprien Bone, Matthew Walker, Kuangdai Leng, Luis M. Antunes, Ricardo Grau-Crespo, Amil Aligayev, Javier Dominguez, Keith T. Butler

TL;DR

CrystaLLM-$\pi$ tackles the challenge of conditioning crystal-structure generation on continuous properties by introducing Property-Key-Value (PKV) Prefix and PKV Residual attention. The approach injects continuous property representations directly into transformer attention, preserving pre-trained crystallographic knowledge while enabling inverse design for structure recovery from XRD data and discovery of photovoltaic materials, validated by DFT. Key findings show attention-level conditioning yields robust performance across data regimes, that XRD-conditioned generation accelerates structure recovery, and that PV-focused generation can implicitly target optimal band-gap regions with compositionally novel candidates. The framework offers a lightweight, flexible pathway for inverse materials design, capable of leveraging large unlabeled datasets and targeted labeled data to map complex structure-property relationships.

Abstract

Generative models have recently shown great promise for accelerating the design and discovery of new functional materials. Conditional generation enhances this capacity by allowing inverse design, where specific desired properties can be requested during the generation process. However, conditioning of transformer-based approaches, in particular, is constrained by discrete tokenisation schemes and the risk of catastrophic forgetting during fine-tuning. This work introduces CrystaLLM-π (property injection), a conditional autoregressive framework that integrates continuous property representations directly into the transformer's attention mechanism. Two architectures, Property-Key-Value (PKV) Prefix attention and PKV Residual attention, are presented. These methods bypass inefficient sequence-level tokenisation and preserve foundational knowledge from unsupervised pre-training on Crystallographic Information Files (CIFs) as textual input. We establish the efficacy of these mechanisms through systematic robustness studies and evaluate the framework's versatility across two distinct tasks. First, for structure recovery, the model processes high-dimensional, heterogeneous X-ray diffraction patterns, achieving structural accuracy competitive with specialised models and demonstrating applications to experimental structure recovery and polymorph differentiation. Second, for materials discovery, the model is fine-tuned on a specialised photovoltaic dataset to generate novel, stable candidates validated by Density Functional Theory (DFT). It implicitly learns to target optimal band gap regions for high photovoltaic efficiency, demonstrating a capability to map complex structure-property relationships. CrystaLLM-π provides a unified, flexible, and computationally efficient framework for inverse materials design.

Discovery and recovery of crystalline materials with property-conditioned transformers

TL;DR

CrystaLLM- tackles the challenge of conditioning crystal-structure generation on continuous properties by introducing Property-Key-Value (PKV) Prefix and PKV Residual attention. The approach injects continuous property representations directly into transformer attention, preserving pre-trained crystallographic knowledge while enabling inverse design for structure recovery from XRD data and discovery of photovoltaic materials, validated by DFT. Key findings show attention-level conditioning yields robust performance across data regimes, that XRD-conditioned generation accelerates structure recovery, and that PV-focused generation can implicitly target optimal band-gap regions with compositionally novel candidates. The framework offers a lightweight, flexible pathway for inverse materials design, capable of leveraging large unlabeled datasets and targeted labeled data to map complex structure-property relationships.

Abstract

Generative models have recently shown great promise for accelerating the design and discovery of new functional materials. Conditional generation enhances this capacity by allowing inverse design, where specific desired properties can be requested during the generation process. However, conditioning of transformer-based approaches, in particular, is constrained by discrete tokenisation schemes and the risk of catastrophic forgetting during fine-tuning. This work introduces CrystaLLM-π (property injection), a conditional autoregressive framework that integrates continuous property representations directly into the transformer's attention mechanism. Two architectures, Property-Key-Value (PKV) Prefix attention and PKV Residual attention, are presented. These methods bypass inefficient sequence-level tokenisation and preserve foundational knowledge from unsupervised pre-training on Crystallographic Information Files (CIFs) as textual input. We establish the efficacy of these mechanisms through systematic robustness studies and evaluate the framework's versatility across two distinct tasks. First, for structure recovery, the model processes high-dimensional, heterogeneous X-ray diffraction patterns, achieving structural accuracy competitive with specialised models and demonstrating applications to experimental structure recovery and polymorph differentiation. Second, for materials discovery, the model is fine-tuned on a specialised photovoltaic dataset to generate novel, stable candidates validated by Density Functional Theory (DFT). It implicitly learns to target optimal band gap regions for high photovoltaic efficiency, demonstrating a capability to map complex structure-property relationships. CrystaLLM-π provides a unified, flexible, and computationally efficient framework for inverse materials design.

Paper Structure

This paper contains 44 sections, 10 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: An overview of the property conditioned crystal structure generation framework. a. The training process involves optimising the model's parameters to predict the subsequent token in a given CIF sequence. This prediction is conditioned on previous tokens in the CIF sequence and a learned embedding from target functional properties $P$, which is injected into each multi-head attention layer. b. The inference process uses a prompt and a target property vector to autoregressively generate a complete CIF by iteratively sampling and appending the next predicted token.
  • Figure 2: Prefix and Residual Conditioning method architectures. The Prefix and Residual Encoders (See Appendix \ref{['app:encoders']} for details) handle the transformation of the property condition vector into an embedding of the same size as the attention's Key-Value tensors. These are injected directly to the attention layers of the transformer, where the modified mechanism allows for attention to both conditions and sequence, for property aware generation of CIFs. $N$ Represents the number of decoder blocks, and $h$ is the number of attention heads. $\alpha$ is the learned Residual attention scaling parameter from Equation \ref{['eq:Residual_attention']}.
  • Figure 3: Study of how various conditioning methods are affected by starting with a pre-trained model or not with respect to the following metrics: Hit-Rate for a target compared to the desired band gap (a), evolution of valid structure amount (b) and quality (c). Metrics are compared for each method with 1K generation attempts at different conditioning targets (See Appendix \ref{['app:experimental_protocols']} for exact target values), labeled with markers. The generation quality ($Q_{\mathrm{VSUN}}$) is a metric which groups the Validity, Stability, Uniqueness and Novelty (VSUN) metrics, which is true if all these are satisfied (Structural Novelty is measured here, Methods \ref{['sec:eval_metrics']} for additional details). The icons to the right side of each plot represent the mean fraction of the conditioning methods over all targets, indicating overall performance. For each of the sub-panels, the frequency of band gaps seen during training is presented in grey.
  • Figure 4: Effect of dataset size on density conditioned generation capabilities across three different methods and four different dataset sizes. Each sub-panel represents a different conditioning target (dotted line parallel to the y-axis), where each curved line informs on the distribution of structure density for the generated materials given a target, method and dataset size. Icons above the lines represent mean density ($x$-axis) and total valid count ($y$-axis) for each condition. For each of the sub-panels, the frequency of density values seen during training is presented in grey.
  • Figure 5: Evaluation of CrystaLLM-$\pi$ for crystal structure recovery from experimental XRD data. (Top) Scatter plots display predicted (from generated structures) versus true (experimental structure) lattice parameters ($a$, $b$, $c$) and cell volume. The analysis compares model performance when conditioned with XRD data during generation (A) versus without conditioning (B). Marker colour distinguishes between matched structures (green) and non-matched (orange). Marker style informs on system complexity of experimental structures, ranging from binary to senary. The plotted points represent the best-matching structure from '20-consistent' generation attempts, with results aggregated from three identical and independent experiment repeats. (Bottom) Quantitative performance metrics are compared for models with and without XRD conditioning. Performance is assessed for both the '1-perplexity' metric and '20-consistent' generation attempts. Additionally, the computational time required to generate '20-consistent' structures for the 198 materials in the test set is reported, with an inference batch size of 20. All results are averaged over three identical and independent experiments to ensure robustness and report associated standard errors.
  • ...and 9 more figures