Table of Contents
Fetching ...

Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?

Youssef Abo-Dahab, Ruby Hernandez, Ismael Caleb Arechiga Duran

TL;DR

Results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.

Abstract

The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.

Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?

TL;DR

Results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.

Abstract

The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.
Paper Structure (34 sections, 5 equations, 5 figures, 7 tables)

This paper contains 34 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The scatter plot illustrates the trade-off between predictive performance (Drug-Protein PR-AUC) and computational cost (Peak VRAM, log scale). While highly expressive architectures like the Blackwell GNN (purple diamond) establish the predictive ceiling (0.5910 PR-AUC) at a massive memory cost, scaling structural model complexity exhibits severe diminishing returns. Crucially, the Efficient Topological Model (green star)—which ablates explicit 2D/3D drug structure encoders in favor of pure topological embeddings and ESM-2 protein features—achieves 95% of state-of-the-art performance (0.5785 PR-AUC) utilizing less than 1% of the memory footprint. This demonstrates that for macro-scale repurposing, combining relational topology with high-fidelity target representations offers a vastly superior efficiency trade-off compared to computationally heavy small-molecule graph convolutions.
  • Figure 2: Benchmark comparison of Knowledge Graph Embeddings (KGEs) versus the Standard GNN. Performance is measured by Area Under the Precision-Recall Curve (PR-AUC) for both Drug-Protein (blue) and Drug-Indication (orange) link prediction tasks. While shallow KGEs like ComplEx and DistMult offer highly memory-efficient baselines, the Standard GNN establishes a significantly higher predictive ceiling due to the integration of ESM-2 protein representations.
  • Figure 3: Feature ablation study isolating the impact of structural versus topological features on Drug-Protein PR-AUC. Removing the explicit 2D/3D drug graphs (Ablation 2) yields the highest predictive performance, outperforming both the full GAT-based model and the Morgan fingerprint (MLP) baseline. This demonstrates that for macro-scale repurposing of approved drugs, pure relational topology provides a superior signal compared to static chemical structures. Conversely, removing the ESM-2 protein sequence representations (Ablation 1 and 3) severely degrades performance.
  • Figure 4: Scaling laws contrasting data volume versus model complexity. The plot tracks Drug-Protein PR-AUC as resources are scaled relative to the 3.44M parameter baseline on the full graph. The model is highly sensitive to graph density (blue line), showing a rapid decline as training data is reduced. In contrast, reducing the network's parameter capacity by over 50% (red line) results in negligible performance loss. Notably, a lightweight model (1.66M parameters) trained on the complete knowledge graph outperforms a massive model (3.44M parameters) constrained to 50% of the data.
  • Figure 5: Computational footprint scaling. (Left) Total training time scales linearly with graph density (data scaling) due to the computational overhead of message-passing over edges. Altering the parameter count (hidden dimension) has virtually no impact on training speed. (Right) Peak GPU VRAM remains static at $\sim$5.2 GB across all scales, indicating that memory bottlenecks are driven by static node features (e.g., ESM-2 embeddings) and graph topology, rather than the model's learnable weight matrices.