Table of Contents
Fetching ...

Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

Jia Song, Wanru Zhuang, Yujie Lin, Liang Zhang, Chunyan Li, Jinsong Su, Song He, Xiaochen Bo

TL;DR

This work proposes a novel cross-modal text-molecule retrieval model with two-fold improvements, which achieves SOTA performance, outperforming the previously-reported best result by 6.4%.

Abstract

Cross-modal text-molecule retrieval model aims to learn a shared feature space of the text and molecule modalities for accurate similarity calculation, which facilitates the rapid screening of molecules with specific properties and activities in drug design. However, previous works have two main defects. First, they are inadequate in capturing modality-shared features considering the significant gap between text sequences and molecule graphs. Second, they mainly rely on contrastive learning and adversarial training for cross-modality alignment, both of which mainly focus on the first-order similarity, ignoring the second-order similarity that can capture more structural information in the embedding space. To address these issues, we propose a novel cross-modal text-molecule retrieval model with two-fold improvements. Specifically, on the top of two modality-specific encoders, we stack a memory bank based feature projector that contain learnable memory vectors to extract modality-shared features better. More importantly, during the model training, we calculate four kinds of similarity distributions (text-to-text, text-to-molecule, molecule-to-molecule, and molecule-to-text similarity distributions) for each instance, and then minimize the distance between these similarity distributions (namely second-order similarity losses) to enhance cross-modal alignment. Experimental results and analysis strongly demonstrate the effectiveness of our model. Particularly, our model achieves SOTA performance, outperforming the previously-reported best result by 6.4%.

Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

TL;DR

This work proposes a novel cross-modal text-molecule retrieval model with two-fold improvements, which achieves SOTA performance, outperforming the previously-reported best result by 6.4%.

Abstract

Cross-modal text-molecule retrieval model aims to learn a shared feature space of the text and molecule modalities for accurate similarity calculation, which facilitates the rapid screening of molecules with specific properties and activities in drug design. However, previous works have two main defects. First, they are inadequate in capturing modality-shared features considering the significant gap between text sequences and molecule graphs. Second, they mainly rely on contrastive learning and adversarial training for cross-modality alignment, both of which mainly focus on the first-order similarity, ignoring the second-order similarity that can capture more structural information in the embedding space. To address these issues, we propose a novel cross-modal text-molecule retrieval model with two-fold improvements. Specifically, on the top of two modality-specific encoders, we stack a memory bank based feature projector that contain learnable memory vectors to extract modality-shared features better. More importantly, during the model training, we calculate four kinds of similarity distributions (text-to-text, text-to-molecule, molecule-to-molecule, and molecule-to-text similarity distributions) for each instance, and then minimize the distance between these similarity distributions (namely second-order similarity losses) to enhance cross-modal alignment. Experimental results and analysis strongly demonstrate the effectiveness of our model. Particularly, our model achieves SOTA performance, outperforming the previously-reported best result by 6.4%.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The architecture of our model. It mainly consists of four modules: a text encoder, a molecule encoder and a discriminator distinguishing between two modalities, and a memory bank based feature projector that introduces learnable memory vectors to learn the multi-modal feature space. In addition to the conventional contrastive learning loss $\mathcal{L}_{cl}$ and adversarial training loss $\mathcal{L}_{adv}$, we incorporate second-order similarity losses $\mathcal{L}_{\mathrm{u2u}}$ and $\mathcal{L}_{\mathrm{u2c}}$ to enhance cross-modality alignment.
  • Figure 2: The diagram illustrates how the outputs of the two encoders are processed by the memory bank based feature projector to obtain the final modality-shared feature representations.
  • Figure 3: Results of our model w.r.t different weights of $\mathcal{L}_{\mathrm{u2u}}$ and $\mathcal{L}_{\mathrm{u2c}}$ in text-to-molecule (T2M) retrieval task and molecule-to-text (M2T) retrieval task. The variant of our model, denoted as Ours(u2c), pertains to the model when adjusting the weight of $\mathcal{L}{\mathrm{u2c}}$, while Ours(u2u) refers to the model when adjusting the weight of $\mathcal{L}{\mathrm{u2u}}$.
  • Figure 4: Comparison of MolT5 and our model-enhanced MolT5 on molecule caption task on the ChEBI-20 test set. Our model + MolT5 represents the performance after concatenating molecule graph features to the input of the MolT5 encoder.
  • Figure 5: The distributions of the modality gap with kernel density estimation (KDE) on ChEBI-20 test set.
  • ...and 1 more figures