GRAM-DTI: adaptive multimodal representation learning for drug target interaction prediction
Feng Jiang, Amina Mollaysa, Hehuan Ma, Tommaso Mansi, Junzhou Huang, Mangal Prakash, Rui Liao
TL;DR
GRAM-DTI tackles DTI prediction by integrating four modalities (SMILES, text/HTA, protein sequences, and IC50 annotations) through Gramian volume-based multimodal alignment, enabling higher-order cross-modal interactions. It introduces gradient-informed adaptive modality dropout to avoid modality dominance and leverages IC50 as weak supervision to ground representations in biologically relevant activity. The framework yields state-of-the-art performance across multiple DTI and MoA benchmarks, with particularly strong gains in cold-start and zero-shot retrieval scenarios, demonstrating robust generalization. These results highlight the value of richer multimodal pretraining for drug discovery, offering improved target identification and repurposing potential while reducing reliance on labeled data.
Abstract
Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
