Table of Contents
Fetching ...

GRAM-DTI: adaptive multimodal representation learning for drug target interaction prediction

Feng Jiang, Amina Mollaysa, Hehuan Ma, Tommaso Mansi, Junzhou Huang, Mangal Prakash, Rui Liao

TL;DR

GRAM-DTI tackles DTI prediction by integrating four modalities (SMILES, text/HTA, protein sequences, and IC50 annotations) through Gramian volume-based multimodal alignment, enabling higher-order cross-modal interactions. It introduces gradient-informed adaptive modality dropout to avoid modality dominance and leverages IC50 as weak supervision to ground representations in biologically relevant activity. The framework yields state-of-the-art performance across multiple DTI and MoA benchmarks, with particularly strong gains in cold-start and zero-shot retrieval scenarios, demonstrating robust generalization. These results highlight the value of richer multimodal pretraining for drug discovery, offering improved target identification and repurposing potential while reducing reliance on labeled data.

Abstract

Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.

GRAM-DTI: adaptive multimodal representation learning for drug target interaction prediction

TL;DR

GRAM-DTI tackles DTI prediction by integrating four modalities (SMILES, text/HTA, protein sequences, and IC50 annotations) through Gramian volume-based multimodal alignment, enabling higher-order cross-modal interactions. It introduces gradient-informed adaptive modality dropout to avoid modality dominance and leverages IC50 as weak supervision to ground representations in biologically relevant activity. The framework yields state-of-the-art performance across multiple DTI and MoA benchmarks, with particularly strong gains in cold-start and zero-shot retrieval scenarios, demonstrating robust generalization. These results highlight the value of richer multimodal pretraining for drug discovery, offering improved target identification and repurposing potential while reducing reliance on labeled data.

Abstract

Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.

Paper Structure

This paper contains 54 sections, 14 equations, 8 figures, 6 tables, 3 algorithms.

Figures (8)

  • Figure 1: Overview of GRAM-DTI architecture. Left: pretraining phase with volume-based multimodal alignment across four modalities (SMILES, text, HTA, protein sequences). The framework uses gradient-informed adaptive modality selection to dynamically regulate modality contributions during training. Right: downstream task prediction.
  • Figure 2: Ablation study results on the Activation dataset across five experimental configurations and three data splitting scenarios. The full GRAM-DTI model (Exp 1) outperforms variants with removed components in most cases, demonstrating the synergistic contribution of each training objective component.
  • Figure 3: Evolution of multimodal embeddings during GRAM-DTI pre-training visualized using t-SNE on 3,000 samples. Four modalities (SMILES, Text, HTA, Protein) progressively align from separate clusters to semantically integrated representations, demonstrating effective volume-based multimodal alignment.
  • Figure 4: Ablation study results on the Yamanishi 08 dataset across five experimental configurations and three data splitting scenarios. The full GRAM-DTI model (Exp 1) consistently outperforms variants with removed components across most metrics and scenarios, demonstrating the robust contribution of each training objective component. Results complement those shown in Figure \ref{['fig:ablation_1']} (Activation dataset) and confirm the generalizability of our design choices across different DTI prediction benchmarks.
  • Figure 5: Illustration of zero-shot retrieval evaluation. A query protein $p_j$ is compared against all candidate drugs $\{d_{i-1}, d_i, d_{i+1}, ...\}$ using cosine similarity of learned embeddings. Recall@k metrics evaluate whether any known positive interactions appear in the top-k retrieved candidates.
  • ...and 3 more figures