Table of Contents
Fetching ...

MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion

Syed Omer Shah, Mohammed Maqsood Ahmed, Danish Mohiuddin Mohammed, Shahnawaz Alam, Mohd Vahaj ur Rahman

TL;DR

MolFM-Lite is presented, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs, and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM).

Abstract

Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model's own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.

MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion

TL;DR

MolFM-Lite is presented, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs, and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM).

Abstract

Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model's own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.
Paper Structure (52 sections, 9 equations, 5 figures, 8 tables)

This paper contains 52 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: MolFM-Lite architecture. Three modality encoders process SELFIES (1D Transformer), molecular graph (2D GIN), and conformer ensemble (3D SchNet-Lite). Conformer ensemble attention aggregates multiple 3D conformations. Cross-modal fusion lets each modality attend to others. Context conditioning (FiLM) incorporates experimental metadata. A two-layer MLP with MC Dropout produces task predictions.
  • Figure 2: Benchmark comparison across all MoleculeNet datasets. MolFM-Lite (rightmost bars) consistently outperforms single-modality baselines across all four tasks.
  • Figure 3: Ablation heatmap summarizing the impact of each architectural component across all four datasets. Each cell shows the absolute performance change ($\Delta$) when a component is removed. Darker shading indicates larger degradation. Tri-modal fusion and pre-training show the largest and most consistent effects across tasks.
  • Figure 4: Conformer ensemble attention analysis. Left: correlation between learned attention weights and Boltzmann factors across the BBBP test set. Right: deviation patterns, where the model more frequently up-weights higher-energy conformers for uncertain predictions, suggesting the bioactive shape matters most in these cases.
  • Figure 5: Cross-modal attention weight analysis. Visualization of pairwise attention scores between modalities (1D$\rightarrow$2D, 1D$\rightarrow$3D, 2D$\rightarrow$3D) across molecules in the BBBP test set. The 1D encoder attends most strongly to 2D graph features, while the 2D encoder selectively attends to 3D spatial features for molecules with flexible scaffolds.