Table of Contents
Fetching ...

Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation

Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards, Raymundo Arróyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji

TL;DR

This work tackles the challenge of generating crystalline materials with language models by addressing the non-uniqueness and invariance of sequence representations derived from 3D structures. It introduces Mat2Seq, a pipeline that first obtains an $SO(3)$-equivariant, periodic-invariant unit cell via Niggli reduction, then maps it to an $SE(3)$-invariant and complete 1D sequence encoding lattice parameters, space-group information, and atomic coordinates. By enforcing uniqueness, Mat2Seq enables reliable, augment-reduced LM training and conditional generation for target compositions and properties, with experimental validation showing competitive performance against state-of-the-art methods and the ability to generalize to literature discoveries and band-gap targeted design. The approach paves the way for scalable, LM-assisted discovery of novel crystalline materials with desired properties, while outlining limitations and avenues for future extension to broader atomic systems and disordered materials.

Abstract

We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.

Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation

TL;DR

This work tackles the challenge of generating crystalline materials with language models by addressing the non-uniqueness and invariance of sequence representations derived from 3D structures. It introduces Mat2Seq, a pipeline that first obtains an -equivariant, periodic-invariant unit cell via Niggli reduction, then maps it to an -invariant and complete 1D sequence encoding lattice parameters, space-group information, and atomic coordinates. By enforcing uniqueness, Mat2Seq enables reliable, augment-reduced LM training and conditional generation for target compositions and properties, with experimental validation showing competitive performance against state-of-the-art methods and the ability to generalize to literature discoveries and band-gap targeted design. The approach paves the way for scalable, LM-assisted discovery of novel crystalline materials with desired properties, while outlining limitations and avenues for future extension to broader atomic systems and disordered materials.

Abstract

We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.

Paper Structure

This paper contains 20 sections, 4 theorems, 6 equations, 4 figures, 11 tables.

Key Result

Lemma 1

A sequence mapping function $f:(\mathbf{A},\mathbf{P}, \mathbf{L}) \to \mathcal{X}$ is unique, if function $f$ is periodic invariant and unit cell $SE(3)$ invariant.

Figures (4)

  • Figure 1: Limitations of directly using CIF files in achieving unique crystal sequence representations. This figure demonstrates variations in CIF files, either with or without symmetry control command denoted as "symprec", for the same crystal structure subjected to periodic transformations. Changes in the CIF contents are highlighted in red. Periodic transformations can significantly alter the unit cell structures, resulting in distinct CIF files differed by fractional coordinates, atom ordering, and lattice parameters for the same underlying crystal.
  • Figure 2: The pipeline of Mat2Seq that converts 3D crystal structures into unique crystal sequences. Mat2Seq first determines $SO(3)$ equivariant and periodic invariant lattice vectors using Niggli cell reduction shi2022niggli, then determines the primitive unit cell. After that, Mat2Seq converts the determined $SO(3)$ equivariant and periodic invariant primitive cells into $SE(3)$ and periodic invariant sequences.
  • Figure 3: Converting determined unit cells into invariant crystal sequences.
  • Figure 4: Mat2Seq can generate recently discovered novel crystals from literature. Eu$_2$FeGe$_2$OS$_6$ on the left, and $\text{Ce}_{6}\text{Cd}_{23}\text{Te}$ on the right. The structure generated by Mat2Seq for Eu$_2$FeGe$_2$OS$_6$ is the reflected version of the ground truth.

Theorems & Definitions (10)

  • Definition 1: Unit Cell $SE(3)$ Invariance
  • Definition 2: Periodic Invariance
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof