Table of Contents
Fetching ...

OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

Emily Jin, Andrei Cristian Nica, Mikhail Galkin, Jarrid Rector-Brooks, Kin Long Kelvin Lee, Santiago Miret, Frances H. Arnold, Michael Bronstein, Avishek Joey Bose, Alexander Tong, Cheng-Hao Liu

TL;DR

Crystal structure prediction for organic molecules is a long-standing challenge due to the coupling of intramolecular conformers and periodic packing. OXtal introduces a large-scale all-atom diffusion transformer conditioned on 2D graphs and a lattice-free training scheme (S^4) with SE(3) data augmentation to learn the joint distribution of conformations and packings. Trained on ~600k experimental crystals, it achieves sub-angstrom conformer accuracy ($RMSD_1<0.5$ Å) and high packing similarity, beating prior ML CSP methods and offering orders-of-magnitude cheaper inference than DFT. The approach demonstrates robust handling of polymorphs, co-crystals, and diverse interactions, enabling scalable, high-fidelity CSP at industrially relevant scales.

Abstract

Accurately predicting experimentally-realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer $\text{RMSD}_1<0.5$ Å and attains over 80\% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

TL;DR

Crystal structure prediction for organic molecules is a long-standing challenge due to the coupling of intramolecular conformers and periodic packing. OXtal introduces a large-scale all-atom diffusion transformer conditioned on 2D graphs and a lattice-free training scheme (S^4) with SE(3) data augmentation to learn the joint distribution of conformations and packings. Trained on ~600k experimental crystals, it achieves sub-angstrom conformer accuracy ( Å) and high packing similarity, beating prior ML CSP methods and offering orders-of-magnitude cheaper inference than DFT. The approach demonstrates robust handling of polymorphs, co-crystals, and diverse interactions, enabling scalable, high-fidelity CSP at industrially relevant scales.

Abstract

Accurately predicting experimentally-realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling (), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer Å and attains over 80\% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

Paper Structure

This paper contains 44 sections, 5 theorems, 21 equations, 22 figures, 13 tables, 2 algorithms.

Key Result

Proposition 1

Let $\partial \mathbf{A}_{\text{crop}} = \{\{u,v\} \in E : u \in \mathbf{A}_{\text{crop}} , v \notin \mathbf{A}_{\text{crop}} \}$ represent the boundary of $\mathbf{A}_{\text{crop}}$. Denote the number of atoms in a volume $C$ as $T(C)$. Let $L_\partial(\mathbf{A}_{\text{crop}}) = \sum_{\{u,v\} \in

Figures (22)

  • Figure 1: Molecular crystal structures generated by OXtal (color) compared to ground truth (grey).
  • Figure 1: Performance of ab initio ML models on rigid and flexible molecular CSP in 30 samples. OXtal achieves an order of magnitude improvement and is the only model able to approximately solve any crystals in the flexible dataset.
  • Figure 2: Molecular crystals consist of distinct molecules held together via long-range, weak interactions. They typically contain many atoms per unit cell and unknown molecule copies $Z$.
  • Figure 2: Results for the 5th, 6th, and 7th CCDC CSP blind tests. Classical chemistry methods are aggregated as DFTavg. The best model is bolded, and the second best is underlined.
  • Figure 3: (a) Schematic of a rugged crystallization Gibbs free energy landscape with many local minima. Kinetic conditions often dictate which experimental minimum is formed. (b) Molecular crystallization, showing nucleation and growth in successive layers, which is the inspiration for $S^4$. (c) Common packing motifs exemplified in co-crystal polymorphs with 1:1 and 2:1 stoichiometric ratio. (d) Overview of OXtal architecture.
  • ...and 17 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 1
  • Lemma 1
  • proof
  • Lemma 2
  • Lemma 3
  • proof
  • proof