Table of Contents
Fetching ...

Geometry Informed Tokenization of Molecules for Language Model Generation

Xiner Li, Limei Wang, Youzhi Luo, Carl Edwards, Shurui Gui, Yuchao Lin, Heng Ji, Shuiwang Ji

TL;DR

The Geo2Seq is proposed, which converts molecular geometries into $SE(3)$-invariant 1D discrete sequences and consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs.

Abstract

We consider molecule generation in 3D space using language models (LMs), which requires discrete tokenization of 3D molecular geometries. Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. Here, we attempt to bridge this gap by proposing the Geo2Seq, which converts molecular geometries into $SE(3)$-invariant 1D discrete sequences. Geo2Seq consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs. Our experiments show that, when coupled with Geo2Seq, various LMs excel in molecular geometry generation, especially in controlled generation tasks.

Geometry Informed Tokenization of Molecules for Language Model Generation

TL;DR

The Geo2Seq is proposed, which converts molecular geometries into -invariant 1D discrete sequences and consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs.

Abstract

We consider molecule generation in 3D space using language models (LMs), which requires discrete tokenization of 3D molecular geometries. Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. Here, we attempt to bridge this gap by proposing the Geo2Seq, which converts molecular geometries into -invariant 1D discrete sequences. Geo2Seq consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs. Our experiments show that, when coupled with Geo2Seq, various LMs excel in molecular geometry generation, especially in controlled generation tasks.
Paper Structure (32 sections, 9 theorems, 51 equations, 10 figures, 12 tables)

This paper contains 32 sections, 9 theorems, 51 equations, 10 figures, 12 tables.

Key Result

Lemma 3.2

[Canonical Labeling for Colored Graph Isomorphism] Let $G_1 = (V_1, E_1, A_1)$ and $G_2 = (V_2, E_2, A_2)$ be two finite, undirected graphs where $V_i$ denotes the set of vertices, $E_i$ denotes the set of edges, and $A_i$ denotes the node attributes of the graph $G_i$ for $i = 1, 2$. Let ${\bm{L}}: where $G_1 \cong G_2$ denotes that $G_1$ and $G_2$ are isomorphic.

Figures (10)

  • Figure 1: Overview of Geo2Seq. We use the canonical labeling order to arrange nodes in a row, fill in the place of each node with vector $[z_i,d_i,\theta_i,\phi_i]$, and concatenate all elements into a sequence. Each node vector contains atom type and spherical coordinates. Notably, the spherical coordinates are $SE(3)$-invariant.
  • Figure 2: Illustrations of the equivariant frame and invariant spherical representations. If the molecule is rotated and translated by a rotation matrix ${\bm{Q}}$ and a translation vector ${\bm{b}}$, the atom coordinates change accordingly. But our spherical representations remain invariant since the frame is equivariant to the $SE(3)$-transformation.
  • Figure 3: Visualization of generated molecules condition on the property of Polarizability $\alpha$.
  • Figure 4: Visualization of molecules generated from Geo2Seq with Mamba trained on QM9.
  • Figure 5: Visualization of molecules generated from Geo2Seq with Mamba trained on GEOM-DRUGS.
  • ...and 5 more figures

Theorems & Definitions (15)

  • Definition 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Definition 3.4
  • Theorem 3.5
  • Corollary 3.6
  • Lemma : Colored Canonical Labeling for Graph Isomorphism
  • Lemma
  • proof
  • Lemma B.1
  • ...and 5 more