Table of Contents
Fetching ...

Multimodal Fusion with Relational Learning for Molecular Property Prediction

Zhengyang Zhou, Yunrui Li, Pengyu Hong, Hao Xu

TL;DR

This paper tackles the limitations of graph-based molecular representations by introducing MMFRL, which combines a modified relational learning objective with multimodal pretraining and flexible fusion strategies. The approach leverages multiple modalities (e.g., SMILES, images, fingerprints, NMR data) to initialize and fine-tune molecular encoders, and systematically analyzes early, intermediate, and late fusion to understand their impact on predictive performance. MMFRL achieves state-of-the-art results on MoleculeNet benchmarks, provides explainability through case studies like ESOL and BACE, and demonstrates that continuous relational metrics can better capture inter-molecule relationships than binary contrastive signals. The work advances practical drug discovery workflows by enabling task-specific multimodal pretraining and by offering interpretable representations that reveal structure–activity relationships.

Abstract

Graph based molecular representation learning is essential for accurately predicting molecular properties in drug discovery and materials science; however, it faces significant challenges due to the intricate relationships among molecules and the limited chemical knowledge utilized during training. While contrastive learning is often employed to handle molecular relationships, its reliance on binary metrics is insufficient for capturing the complexity of these interactions. Multimodal fusion has gained attention for property reasoning, but previous work has explored only a limited range of modalities, and the optimal stages for fusing different modalities in molecular property tasks remain underexplored. In this paper, we introduce MMFRL (Multimodal Fusion with Relational Learning for Molecular Property Prediction), a novel framework designed to overcome these limitations. Our method enhances embedding initialization through multimodal pretraining using relational learning. We also conduct a systematic investigation into the impact of modality fusion at different stages such as early, intermediate, and late, highlighting their advantages and shortcomings. Extensive experiments on MoleculeNet benchmarks demonstrate that MMFRL significantly outperforms existing methods. Furthermore, MMFRL enables task-specific optimizations. Additionally, the explainability of MMFRL provides valuable chemical insights, emphasizing its potential to enhance real-world drug discovery applications.

Multimodal Fusion with Relational Learning for Molecular Property Prediction

TL;DR

This paper tackles the limitations of graph-based molecular representations by introducing MMFRL, which combines a modified relational learning objective with multimodal pretraining and flexible fusion strategies. The approach leverages multiple modalities (e.g., SMILES, images, fingerprints, NMR data) to initialize and fine-tune molecular encoders, and systematically analyzes early, intermediate, and late fusion to understand their impact on predictive performance. MMFRL achieves state-of-the-art results on MoleculeNet benchmarks, provides explainability through case studies like ESOL and BACE, and demonstrates that continuous relational metrics can better capture inter-molecule relationships than binary contrastive signals. The work advances practical drug discovery workflows by enabling task-specific multimodal pretraining and by offering interpretable representations that reveal structure–activity relationships.

Abstract

Graph based molecular representation learning is essential for accurately predicting molecular properties in drug discovery and materials science; however, it faces significant challenges due to the intricate relationships among molecules and the limited chemical knowledge utilized during training. While contrastive learning is often employed to handle molecular relationships, its reliance on binary metrics is insufficient for capturing the complexity of these interactions. Multimodal fusion has gained attention for property reasoning, but previous work has explored only a limited range of modalities, and the optimal stages for fusing different modalities in molecular property tasks remain underexplored. In this paper, we introduce MMFRL (Multimodal Fusion with Relational Learning for Molecular Property Prediction), a novel framework designed to overcome these limitations. Our method enhances embedding initialization through multimodal pretraining using relational learning. We also conduct a systematic investigation into the impact of modality fusion at different stages such as early, intermediate, and late, highlighting their advantages and shortcomings. Extensive experiments on MoleculeNet benchmarks demonstrate that MMFRL significantly outperforms existing methods. Furthermore, MMFRL enables task-specific optimizations. Additionally, the explainability of MMFRL provides valuable chemical insights, emphasizing its potential to enhance real-world drug discovery applications.

Paper Structure

This paper contains 37 sections, 2 theorems, 28 equations, 7 figures, 10 tables.

Key Result

Theorem 5.1

Let $\mathcal{S}$ be a set of instances with size of $|\mathcal{S}|$, and let $\mathcal{P}$ represent the learnable latent representations of instances in $\mathcal{S}$ such that $|\mathcal{P}| = |\mathcal{S}|$. For any two instances $i, j \in \mathcal{S}$, their respective latent representations ar then when it reaches ideal optimum, the relationship between $t_{i,j}$ and $d_{i,j}$ satisfies:

Figures (7)

  • Figure 1: Multimodal Fusion with Relational Learning for Molecular Property Prediction (MMFRL). This figure shows our proposed idea about how to transfer the knowledge from other modalities and use fusion to improve the performance further. Unlike the general contrastive learning framework shown in Appendix Figure \ref{['fig:traditional-cl']}, MMFRL does not need to define positive or negative pairs and is capable of learning continuous ordering from target similarity. In Early Fusion, a single Init GNN is created by combining all modality information during pretraining. In Intermediate and Late Fusion, each modality has its own initialized GNN.
  • Figure 2: T-SNE visualization depicting the ESOL molecule embeddings for intermediate fusion in Section \ref{['sec:intermediate-fusion']} alongside molecules within the highlighted region. Each point in the heatmap corresponds to the embeddings of respective molecules in ESOL, with color indicating solubility levels. Red denotes higher solubility, while blue indicates lower solubility. The embeddings derived from individual modalities prior to fusion do not display a clear pattern, the embeddings by intermediate fusion forms a gradient that extends from the bottom left (indicating lower solubility) to the upper center (representing higher solubility).
  • Figure 3: This figure shows the distribution of similarities between each modality and the intermediate fusion embedding for ESOL. In both Cosine Similarity and Dot Product, the embeddings from each modality exhibit low similarity with the intermediate-fused representation.
  • Figure 4: Lipo late fusion contribution analysis reveals that the three primary contributors are SMILES, image, and $\text{NMR}_\text{peak}$. In contrast, $\text{NMR}_\text{spectrum}$ and fingerprint exhibit negligible contributions.
  • Figure 5: The left sub-figure is the boxplot of the binding difference for the respective groups of molecules by the top 8 most frequent Minimum Positive Subgraph. The right sub-Figure showsthe detail strucutre of the 5th MPS.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 5.1: Convergence of Modified Relational Learning Metric
  • Theorem B.1: Theorem of Convergent Similarity learning
  • proof