Table of Contents
Fetching ...

Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

Zihao Jing, Yan Sun, Yan Yi Li, Sugitha Janarthanan, Alana Deng, Pingzhao Hu

TL;DR

MuMo tackles conformer-related instability and modality collapse in multimodal molecular representation learning by introducing a Structured Fusion Pipeline (SFP) that unifies 2D topology and 3D geometry into a stable structural prior, and a Progressive Injection (PI) mechanism that asymmetrically injects this prior into the SMILES sequence stream. The fusion relies on a Unified Graph with geometric priors and BRICS-based multiscale partitioning, while the sequence stream is powered by a state-space backbone (Mamba) that supports long-range dependencies and evolving priors. Across 29 datasets spanning TDC, MoleculeNet, and Reaxtica, MuMo achieves an average improvement of $2.7\%$, ranks first on $22$ tasks, and shows up to $27\%$ gains on LD50, demonstrating robustness to conformer noise and effectiveness of structure-guided multimodal fusion. The work presents a practical, efficient framework that improves molecular property prediction by leveraging structured priors and asymmetric cross-modal sharing, with strong potential for deployment in drug discovery workflows.

Abstract

Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.

Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

TL;DR

MuMo tackles conformer-related instability and modality collapse in multimodal molecular representation learning by introducing a Structured Fusion Pipeline (SFP) that unifies 2D topology and 3D geometry into a stable structural prior, and a Progressive Injection (PI) mechanism that asymmetrically injects this prior into the SMILES sequence stream. The fusion relies on a Unified Graph with geometric priors and BRICS-based multiscale partitioning, while the sequence stream is powered by a state-space backbone (Mamba) that supports long-range dependencies and evolving priors. Across 29 datasets spanning TDC, MoleculeNet, and Reaxtica, MuMo achieves an average improvement of , ranks first on tasks, and shows up to gains on LD50, demonstrating robustness to conformer noise and effectiveness of structure-guided multimodal fusion. The work presents a practical, efficient framework that improves molecular property prediction by leveraging structured priors and asymmetric cross-modal sharing, with strong potential for deployment in drug discovery workflows.

Abstract

Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.

Paper Structure

This paper contains 67 sections, 27 equations, 13 figures, 28 tables, 4 algorithms.

Figures (13)

  • Figure 1: Illustration of Limitations in molecular representation learning.
  • Figure 2: Overview of the MuMo architecture. (a) Structural Unified Representation for 2D/3D modalities encoding, (b) Substructure Partitioning for multiscale molecular feature, (c) Fusion Pipeline (right) of 2D topology & 3D geometric priors, and Progressive Injection to integrate cross-modal structural information into the main sequence (left).
  • Figure 3: Multimodal Fusion. It illustrates how structural modalities (2D and 3D) are fused via the Structured Fusion Pipeline (SFP), then injected into the SMILES sequence stream through the Injection Enhanced Attention (IEA, within PI) module.
  • Figure 4: Pretraining loss curves under different modality configurations. Each part shows training (left two figures) and validation (right two figures) loss for a pairwise modality comparison.
  • Figure 5: Layer-wise representation of the pretrained model. UMAP of embeddings across 10 selected scaffolds (5,000 molecules), showing scaffold-level separation at different layers.
  • ...and 8 more figures