Table of Contents
Fetching ...

Addressing Model Overcomplexity in Drug-Drug Interaction Prediction With Molecular Fingerprints

Manel Gil-Sorribes, Alexis Molina

TL;DR

This work tackles the problem of accurately predicting drug–drug interactions and affinities with a low-complexity baseline. It compares Morgan fingerprints (MFPS), pretrained and non-pretrained GCN embeddings, and MoLFormer embeddings within a small neural network architecture, across leak-prone and leak-proof splits. Findings show MFPS and, to a lesser extent, MoLFormer and GCN embeddings offer competitive performance against state-of-the-art models, with strong explainability via gradient-based attribution and motif analysis. The study highlights dataset limitations as a key factor in generalization and emphasizes the value of interpretable baselines and better data curation for progressive complexity scaling, with a formal emphasis on $\log_2(\text{AUC FC})$ as a target metric in DDA tasks.

Abstract

Accurately predicting drug-drug interactions (DDIs) is crucial for pharmaceutical research and clinical safety. Recent deep learning models often suffer from high computational costs and limited generalization across datasets. In this study, we investigate a simpler yet effective approach using molecular representations such as Morgan fingerprints (MFPS), graph-based embeddings from graph convolutional networks (GCNs), and transformer-derived embeddings from MoLFormer integrated into a straightforward neural network. We benchmark our implementation on DrugBank DDI splits and a drug-drug affinity (DDA) dataset from the Food and Drug Administration. MFPS along with MoLFormer and GCN representations achieve competitive performance across tasks, even in the more challenging leak-proof split, highlighting the sufficiency of simple molecular representations. Moreover, we are able to identify key molecular motifs and structural patterns relevant to drug interactions via gradient-based analyses using the representations under study. Despite these results, dataset limitations such as insufficient chemical diversity, limited dataset size, and inconsistent labeling impact robust evaluation and challenge the need for more complex approaches. Our work provides a meaningful baseline and emphasizes the need for better dataset curation and progressive complexity scaling.

Addressing Model Overcomplexity in Drug-Drug Interaction Prediction With Molecular Fingerprints

TL;DR

This work tackles the problem of accurately predicting drug–drug interactions and affinities with a low-complexity baseline. It compares Morgan fingerprints (MFPS), pretrained and non-pretrained GCN embeddings, and MoLFormer embeddings within a small neural network architecture, across leak-prone and leak-proof splits. Findings show MFPS and, to a lesser extent, MoLFormer and GCN embeddings offer competitive performance against state-of-the-art models, with strong explainability via gradient-based attribution and motif analysis. The study highlights dataset limitations as a key factor in generalization and emphasizes the value of interpretable baselines and better data curation for progressive complexity scaling, with a formal emphasis on as a target metric in DDA tasks.

Abstract

Accurately predicting drug-drug interactions (DDIs) is crucial for pharmaceutical research and clinical safety. Recent deep learning models often suffer from high computational costs and limited generalization across datasets. In this study, we investigate a simpler yet effective approach using molecular representations such as Morgan fingerprints (MFPS), graph-based embeddings from graph convolutional networks (GCNs), and transformer-derived embeddings from MoLFormer integrated into a straightforward neural network. We benchmark our implementation on DrugBank DDI splits and a drug-drug affinity (DDA) dataset from the Food and Drug Administration. MFPS along with MoLFormer and GCN representations achieve competitive performance across tasks, even in the more challenging leak-proof split, highlighting the sufficiency of simple molecular representations. Moreover, we are able to identify key molecular motifs and structural patterns relevant to drug interactions via gradient-based analyses using the representations under study. Despite these results, dataset limitations such as insufficient chemical diversity, limited dataset size, and inconsistent labeling impact robust evaluation and challenge the need for more complex approaches. Our work provides a meaningful baseline and emphasizes the need for better dataset curation and progressive complexity scaling.

Paper Structure

This paper contains 18 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Neural network architecture for DDI prediction, consisting of an encoder for feature extraction and a classifier for interaction classification.
  • Figure 2: Label distribution across datasets. Left: Distribution of the first proposed dataset splits, with labels defined according to motifsddi. Right: Distribution in the split proposed by knowddi, using its respective label definitions.
  • Figure 3: Highlighted sulfur-containing motifs in Ritonavir (left) and Cobicistat (right) found using the MFPS encodings.
  • Figure 4: Performance visualization of the DDA regression model using MFPS embeddings. The results highlight the relationship between predicted and ground truth values of $\log_2(\text{AUC FC})$.