Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning
Zihao Jing, Yan Sun, Yan Yi Li, Sugitha Janarthanan, Alana Deng, Pingzhao Hu
TL;DR
MuMo tackles conformer-related instability and modality collapse in multimodal molecular representation learning by introducing a Structured Fusion Pipeline (SFP) that unifies 2D topology and 3D geometry into a stable structural prior, and a Progressive Injection (PI) mechanism that asymmetrically injects this prior into the SMILES sequence stream. The fusion relies on a Unified Graph with geometric priors and BRICS-based multiscale partitioning, while the sequence stream is powered by a state-space backbone (Mamba) that supports long-range dependencies and evolving priors. Across 29 datasets spanning TDC, MoleculeNet, and Reaxtica, MuMo achieves an average improvement of $2.7\%$, ranks first on $22$ tasks, and shows up to $27\%$ gains on LD50, demonstrating robustness to conformer noise and effectiveness of structure-guided multimodal fusion. The work presents a practical, efficient framework that improves molecular property prediction by leveraging structured priors and asymmetric cross-modal sharing, with strong potential for deployment in drug discovery workflows.
Abstract
Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.
