Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards, Raymundo Arróyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji
TL;DR
This work tackles the challenge of generating crystalline materials with language models by addressing the non-uniqueness and invariance of sequence representations derived from 3D structures. It introduces Mat2Seq, a pipeline that first obtains an $SO(3)$-equivariant, periodic-invariant unit cell via Niggli reduction, then maps it to an $SE(3)$-invariant and complete 1D sequence encoding lattice parameters, space-group information, and atomic coordinates. By enforcing uniqueness, Mat2Seq enables reliable, augment-reduced LM training and conditional generation for target compositions and properties, with experimental validation showing competitive performance against state-of-the-art methods and the ability to generalize to literature discoveries and band-gap targeted design. The approach paves the way for scalable, LM-assisted discovery of novel crystalline materials with desired properties, while outlining limitations and avenues for future extension to broader atomic systems and disordered materials.
Abstract
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
