Table of Contents
Fetching ...

MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

Muzhen Cai, Sendong Zhao, Haochun Wang, Yanrui Du, Zewen Qiang, Bing Qin, Ting Liu

TL;DR

Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

Abstract

Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

TL;DR

Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

Abstract

Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

Paper Structure

This paper contains 23 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Diagram of multimodality and multi-granularity. It illustrates how both SMILES and molecule graphs can be represented at molecular and atomic levels, providing a comprehensive view of different granularities and modalities.
  • Figure 2: Diagram of molecular similarity. Comparison of the molecular properties of Aspirin and Paracetamol, with similar properties highlighted in bold.
  • Figure 3: Overview of MolFusion. (a) MolSim component Illustration: Molecule graph and SMILES representations are processed through existing GNN and Transformer encoders, respectively. The computed similarity matrix is compared with the molecular similarity matrix using MSE loss. (b) AtomAlign component Illustration: Randomly masked SMILES are encoded and subtracted from the molecular graph encodings. The resulting difference vector is used to predict masked information by only introducing a linear layer. MolSim and AtomAlign are trained synchronously. (c) Downstream Task Process: SMILES and molecular graphs are encoded by their respective encoders trained through fusion learning. The resulting encodings are aggregated and then passed through a linear layer to predict the outcome. In all downstream tasks, the parameters of both encoders are frozen. Due to the accessibility of SMILES and molecular graph data, either input can be used to generate the other using the RDKit tool landrum2013rdkit, thereby leveraging both fusion-learned encoders. The aggregation operation includes encoder-only, element-wise addition, and concatenation operations.
  • Figure 4: Visualization of molecular vectors from the ZINC dataset under three conditions: (1) No train: The vector spaces of different molecular representations do not overlap without training. (2) Contrastive Learning: The vector spaces between different representations are highly overlapping, leading to the loss of complementary information between the two modalities. (3) MolFusion: Our method results in partially overlapping vector spaces, consistent with the assumption that the information of different representations of the same molecule is partially duplicated, and the unique information of each modality is complementary. The SMILES vectors are represented in yellow, and the molecule graph vectors are represented in blue.
  • Figure 5: The performance of various ablation methods across different datasets.