Table of Contents
Fetching ...

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent

Abstract

Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Abstract

Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.

Paper Structure

This paper contains 61 sections, 126 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the proposed architecture. Left: The Geometric Enhancement Module (GEM) computes pairwise minimum-image geometry under periodic boundary conditions (PBC) from fractional coordinates $\mathbf{f}_t$ and lattice $\mathbf{L}_t$. Two bias terms are constructed: an edge-aware bias $B_{\text{edge}}$ via Fourier features and an Multi-Layer Perceptron (MLP), and a distance-based bias $B_{\text{dist}}$ via scaled minimum distances. These are combined into an additive attention mask (attn_mask). Right: Standard multi-head attention (MHA), where the geometric mask is injected additively into the attention logits before the softmax, thereby modulating attention scores while preserving the canonical $QK^{T}$ formulation.
  • Figure 2: Subatomic tokenization of atomic species. Instead of representing each chemical element by a one-hot identity vector, we assign each element a fixed 34-dimensional chemically structured descriptor built from its period, group, block, and valence-shell occupancies. These descriptors are compressed to a 16-dimensional token space using PCA, yielding a continuous atom-type representation for diffusion. Diffusion noise is applied in this 16-dimensional space, after which a learned embedding maps the noisy token to the Transformer hidden dimension. Representative examples are shown for oxygen (left) and titanium (right).
  • Figure 3: Overview of the Crystalite architecture. The model operates on the continuous crystal state $(\mathbf H,\mathbf F,\mathbf y)$. Atom-type and coordinate embeddings are added to form one token per atom, while the lattice embedding produces a single global lattice token. The resulting sequence is processed by an AdaLN-conditioned Transformer trunk, and output heads predict $\hat{\mathbf H}$, $\hat{\mathbf F}$, and $\hat{\mathbf y}$.
  • Figure 4: Training-time trade-off in de novo generation. UN rate (left), stability (middle), and SUN rate (right) as a function of training steps for two Crystalite runs with different atom-loss settings. The setting that achieves higher stability also loses UN more quickly, whereas the more diversity-preserving setting yields a flatter and more sustained SUN trajectory. Overall, the figure illustrates the central DNG trade-off: improved distributional fit tends to increase stability, but often at the cost of novelty and uniqueness, making checkpoint selection and loss balancing important in practice.
  • Figure 5: Large-scale generation. Uniqueness and unique-and-novel (UN) rate are shown as a function of the number of generated crystals for Crystalite and ADiT. Crystalite consistently preserves more diversity at scale, reaching a higher UN rate at $10^6$ samples and higher uniqueness.
  • ...and 8 more figures