Table of Contents
Fetching ...

SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning

Xinyu Wang, Fei Dou, Jinbo Bi, Minghu Song

Abstract

Linearized string representations serve as the foundation of scalable autoregressive molecular generation; however, they introduce a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences. This ambiguity leads to \textit{trajectory divergence}, where the latent representations of structurally equivalent partial graphs drift apart due to differences in linearization history. To resolve this without abandoning the efficient string formulation, we propose Structure-Invariant Generative Molecular Alignment (SIGMA). Rather than altering the linear representation, SIGMA enables the model to strictly recognize geometric symmetries via a token-level contrastive objective, which explicitly aligns the latent states of prefixes that share identical suffixes. Furthermore, we introduce Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths. Empirical evaluations on standard benchmarks demonstrate that SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.

SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning

Abstract

Linearized string representations serve as the foundation of scalable autoregressive molecular generation; however, they introduce a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences. This ambiguity leads to \textit{trajectory divergence}, where the latent representations of structurally equivalent partial graphs drift apart due to differences in linearization history. To resolve this without abandoning the efficient string formulation, we propose Structure-Invariant Generative Molecular Alignment (SIGMA). Rather than altering the linear representation, SIGMA enables the model to strictly recognize geometric symmetries via a token-level contrastive objective, which explicitly aligns the latent states of prefixes that share identical suffixes. Furthermore, we introduce Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths. Empirical evaluations on standard benchmarks demonstrate that SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.

Paper Structure

This paper contains 41 sections, 7 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Conceptual comparison between Global Contrastive Learning and our proposed Structure-Invariant Autoregressive Contrastive Learning (SIGMA). Top: Traditional methods treat different SMILES permutations (Views A, B, C) of the same molecule as distinct sequences, and only contrast the complete SMILES strings, causing Manifold Fragmentation where functionally equivalent states (substrings) map to disparate points in the latent space. Bottom: SIGMA enforces Trajectory Alignment. By identifying that Views A, B, and C share a topologically identical suffix, SIGMA explicitly aligns their generation paths into a unified trajectory, while pushing away structurally distinct negative samples (View D).
  • Figure 2: Token-level contrastive supervision across equivalent SMILES sequences. Two structurally equivalent SMILES (top: constrained random, bottom: original) are decoded independently using an autoregressive model. Positive alignment losses (suffix-align) are applied over matched suffix tokens (shown in blue and green), while negative repulsion (prefix-repel) is enforced between mismatched prefix tokens (in red). (Left) Token distribution alignment is computed using output logits; (Right) attention alignment compares cross-attention weights across matched positions.
  • Figure 3: Resolving Isomorphic Redundancy with Isomorphic Beam (IsoBeam) Search. (Top) Standard Beam Search: Beam capacity is wasted on multiple linearizations of the same molecule (e.g., variants of acetophenone), causing isomorphic redundancy where topologically identical duplicates crowd the search space. (Bottom) IsoBeam Search: A Partial Graph Check verifies if prefixes share identical substructures and consistent attachment points. Lower-probability isomorphic paths are dynamically pruned (red cross), reallocating the budget to explore distinct trajectories (e.g., a pyridine-based scaffold), thereby maximizing structural diversity.
  • Figure 4: Latent Space Visualization of Geometric Invariance. We utilize t-SNE to project the terminal hidden states of 50 randomized SMILES views for 10 distinct molecules (color-coded by molecular identity). SIGMA (Ours) achieves superior geometric invariance, where isomorphic views collapse into tight, well-separated clusters. In contrast, the Baseline (MLE) exhibits severe manifold fragmentation, with isomorphic trajectories scattered and entangled. Random SMILES Augmentation improves separation but suffers from disconnected islands, where a single molecule maps to multiple disjoint clusters. Last Token Contrastive Learning fails to cleanly disentangle structural identities, demonstrating that dense token-level alignment is essential for learning isotropic representations.
  • Figure 5: Token-wise Latent Alignment Heatmaps. We visualize the cosine similarity matrices of hidden states between two syntactically distinct SMILES linearizations of Acetophenone (y-axis: CC(=O)..., x-axis: O=C(C)...). SIGMA (Left) exhibits a distinct semantic alignment block in the top-left region. Despite the disjoint syntax of the prefix (acetyl group), the model aligns the latent representations of CC(=O) and O=C(C), confirming it captures the underlying structural equivalence. In contrast, the Baseline and GCL models show near-zero similarity (dark blue) in this off-diagonal region, indicating they rely primarily on surface-level token matching and fail to recognize the structural identity of the permuted prefixes. All models successfully align the identical suffix (c1ccccc1, bottom-right diagonal).
  • ...and 3 more figures