Table of Contents
Fetching ...

CrystalFormer-RL: Reinforcement Fine-Tuning for Materials Design

Zhendong Cao, Lei Wang

TL;DR

CrystalFormer-RL addresses the challenge of designing crystalline materials with multiple, potentially conflicting properties by combining a crystal-generative model with discriminative reward signals. The authors implement reinforcement fine-tuning, inspired by RLHF, using MLIP and property predictors as surrogate rewards to steer CrystalFormer toward stable structures and targeted figures of merit. They demonstrate substantial gains in stability (higher fraction of materials with $E_{\mathrm{hull}}<0.1$ eV/atom) and in property-guided discovery (high FoM materials with large $E_g$ and $\varepsilon_{\mathrm{elec}}$), while preserving computational efficiency. The work illustrates a flexible, plug-and-play framework for material design that leverages existing discriminative models to guide generative search and material retrieval.

Abstract

Reinforcement fine-tuning played an instrumental role in enhancing the instruction-following and reasoning abilities of large language models. In this work, we employ reinforcement fine-tuning for materials design, in which discriminative machine learning models are used to provide rewards to the autoregressive transformer-based materials generative model CrystalFormer. By optimizing the reward signals-such as energy above the convex hull and material properties figures of merit-reinforcement fine-tuning infuses knowledge from discriminative models into generative models. The resulting model, CrystalFormer-RL, shows enhanced stability in generated crystals and successfully discovers crystals with desirable yet conflicting material properties, such as substantial dielectric constant and band gap simultaneously. Notably, we observe that reinforcement fine-tuning not only enables the property-guided material design but also unlocks property-based material retrieval behavior of pretrained generative model. The present framework opens an exciting gateway to the synergies of the machine learning ecosystem for materials design.

CrystalFormer-RL: Reinforcement Fine-Tuning for Materials Design

TL;DR

CrystalFormer-RL addresses the challenge of designing crystalline materials with multiple, potentially conflicting properties by combining a crystal-generative model with discriminative reward signals. The authors implement reinforcement fine-tuning, inspired by RLHF, using MLIP and property predictors as surrogate rewards to steer CrystalFormer toward stable structures and targeted figures of merit. They demonstrate substantial gains in stability (higher fraction of materials with eV/atom) and in property-guided discovery (high FoM materials with large and ), while preserving computational efficiency. The work illustrates a flexible, plug-and-play framework for material design that leverages existing discriminative models to guide generative search and material retrieval.

Abstract

Reinforcement fine-tuning played an instrumental role in enhancing the instruction-following and reasoning abilities of large language models. In this work, we employ reinforcement fine-tuning for materials design, in which discriminative machine learning models are used to provide rewards to the autoregressive transformer-based materials generative model CrystalFormer. By optimizing the reward signals-such as energy above the convex hull and material properties figures of merit-reinforcement fine-tuning infuses knowledge from discriminative models into generative models. The resulting model, CrystalFormer-RL, shows enhanced stability in generated crystals and successfully discovers crystals with desirable yet conflicting material properties, such as substantial dielectric constant and band gap simultaneously. Notably, we observe that reinforcement fine-tuning not only enables the property-guided material design but also unlocks property-based material retrieval behavior of pretrained generative model. The present framework opens an exciting gateway to the synergies of the machine learning ecosystem for materials design.

Paper Structure

This paper contains 12 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (a) The reinforcement fine-tuning workflow. Machine learning interatomic potential or property prediction models provide rewards to the material generated by CrystalFormer. The RL training loop updates the parameters of a pre-trained CrystalFormer to maximize the objective function in Eq. (\ref{['eq:rlhf']}). (b) The reinforcement fine-tuned model deviates from the base model to maximize the expected reward with entropy regularization.
  • Figure 2: (a) The average energy above the convex hull of generated materials and (b) the KL divergence between the policy and the base model versus training steps. We set the regularization coefficient $\tau=0.1$.
  • Figure 3: The histogram of energy above convex hull for the crystal samples from the pre-trained base model and the ones relaxed by the Orb model neumann2024orb. In comparison, reinforcement fine-tuned of the model significantly reduces the energy above convex hull of generated materials. The red dashed line indicates the threshold of 0.1 eV/atom for stable materials. Relaxation ratio from 44.7% in the base model to 57.8%, while RL fine-tuning improves the ratio of stable materials to 73.4% even without relaxation.
  • Figure 4: Fraction of generated structures that are stable, novel and unique (S.U.N.) for 14 space groups spanning the seven crystal systems. For each space group, two side-by-side bars correspond to the samples from the pre-trained ("Base") and RL fine-tuned ("RL") models. The S.U.N. structures form a subset of stable materials each space group. The shaded regions separate different crystal systems.
  • Figure 5: The weighted S.U.N. ratio averaged over all space groups of the CrystalFormer trained on the MP-20 dataset and the Alex-20 dataset compared with the fine-tuned model with SFT and RL approaches. The energy above the convex hull is calculated based on the Alexandria convex hull and the novelty is calculated based on the Alex-20 dataset.
  • ...and 5 more figures