Table of Contents
Fetching ...

Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration

Tengwei Song, Min Wu, Yuan Fang

TL;DR

FlexMol tackles the challenge of learning unified molecular representations from both 2D graphs and 3D conformations under incomplete data. It uses separate 2D/3D encoders with shared parameters and decoders to generate missing modalities, enabling Stage 1 training on paired data and Stage 2 continual learning on single-modality data. The optimization combines a contrastive alignment loss and encoder-decoder consistency losses to promote robust cross-modal fusion, with a two-stage pipeline that preserves modality-specific information while achieving cross-modal alignment. Empirical results on molecular property prediction and conformation generation demonstrate competitive performance with large-scale baselines while requiring fewer paired samples, highlighting data efficiency and practical applicability for multimodal molecular representation learning.

Abstract

Molecular representation learning plays a crucial role in advancing applications such as drug discovery and material design. Existing work leverages 2D and 3D modalities of molecular information for pre-training, aiming to capture comprehensive structural and geometric insights. However, these methods require paired 2D and 3D molecular data to train the model effectively and prevent it from collapsing into a single modality, posing limitations in scenarios where a certain modality is unavailable or computationally expensive to generate. To overcome this limitation, we propose FlexMol, a flexible molecule pre-training framework that learns unified molecular representations while supporting single-modality input. Specifically, inspired by the unified structure in vision-language models, our approach employs separate models for 2D and 3D molecular data, leverages parameter sharing to improve computational efficiency, and utilizes a decoder to generate features for the missing modality. This enables a multistage continuous learning process where both modalities contribute collaboratively during training, while ensuring robustness when only one modality is available during inference. Extensive experiments demonstrate that FlexMol achieves superior performance across a wide range of molecular property prediction tasks, and we also empirically demonstrate its effectiveness with incomplete data. Our code and data are available at https://github.com/tewiSong/FlexMol.

Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration

TL;DR

FlexMol tackles the challenge of learning unified molecular representations from both 2D graphs and 3D conformations under incomplete data. It uses separate 2D/3D encoders with shared parameters and decoders to generate missing modalities, enabling Stage 1 training on paired data and Stage 2 continual learning on single-modality data. The optimization combines a contrastive alignment loss and encoder-decoder consistency losses to promote robust cross-modal fusion, with a two-stage pipeline that preserves modality-specific information while achieving cross-modal alignment. Empirical results on molecular property prediction and conformation generation demonstrate competitive performance with large-scale baselines while requiring fewer paired samples, highlighting data efficiency and practical applicability for multimodal molecular representation learning.

Abstract

Molecular representation learning plays a crucial role in advancing applications such as drug discovery and material design. Existing work leverages 2D and 3D modalities of molecular information for pre-training, aiming to capture comprehensive structural and geometric insights. However, these methods require paired 2D and 3D molecular data to train the model effectively and prevent it from collapsing into a single modality, posing limitations in scenarios where a certain modality is unavailable or computationally expensive to generate. To overcome this limitation, we propose FlexMol, a flexible molecule pre-training framework that learns unified molecular representations while supporting single-modality input. Specifically, inspired by the unified structure in vision-language models, our approach employs separate models for 2D and 3D molecular data, leverages parameter sharing to improve computational efficiency, and utilizes a decoder to generate features for the missing modality. This enables a multistage continuous learning process where both modalities contribute collaboratively during training, while ensuring robustness when only one modality is available during inference. Extensive experiments demonstrate that FlexMol achieves superior performance across a wide range of molecular property prediction tasks, and we also empirically demonstrate its effectiveness with incomplete data. Our code and data are available at https://github.com/tewiSong/FlexMol.

Paper Structure

This paper contains 22 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: (a) & (b) Two categories of models that integrate both 2D and 3D molecule modalities, and their respective advantages and drawbacks; (c) Our proposed FlexMol framework.
  • Figure 2: FlexMol framework pipeline. Stage 1: Pre-training unified molecular representation using paired 2D & 3D modalities. Stage 2: Continuous training with single modality molecule data, where the left side represents the 2D-only scenario and the right side represents the 3D-only scenario. The self-attention blocks with the same color indicate shared parameters, while the snowflake icon represents frozen parameters.
  • Figure 3: FlexMol performance on various sizes of 2D/3D-only data in Stage 2. BBBP is evaluated in ROC-AUC ($\uparrow$) and QM7 is evaluated in MAE ($\downarrow$).