Table of Contents
Fetching ...

A Morse Transform for Drug Discovery

Alexander M. Tanaka, Aras T. Asaad, Richard Cooper, Vidit Nanda

TL;DR

This work addresses ligand-based virtual screening under data scarcity by introducing a topology-driven descriptor based on piecewise-linear Morse theory. Ligands are modeled as pruned Delaunay complexes and analyzed across many directions to produce a 72-dimensional Morse feature vector that captures boundary topology via the Morse data of critical points; a lightweight classifier (LightGBM) is then used for binary active/decoy ranking. Chemistry-aware extensions further boost performance, achieving state-of-the-art AUROC on DUD-E (up to $0.97\pm0.03$) and strong results on MUV (up to $0.74\pm0.12$), while maintaining interpretability and scalability. The approach shows robustness to sampling depth and directional resolution, and demonstrates that explicit geometric-topological descriptors can rival or surpass deep learning methods in LBVS with far fewer training examples.

Abstract

We introduce a new ligand-based virtual screening (LBVS) framework that uses piecewise linear (PL) Morse theory to predict ligand binding potential. We model ligands as simplicial complexes via a pruned Delaunay triangulation, and catalogue the critical points across multiple directional height functions. This produces a rich feature vector, consisting of crucial topological features -- peaks, troughs, and saddles -- that characterise ligand surfaces relevant to binding interactions. Unlike contemporary LBVS methods that rely on computationally-intensive deep neural networks, we require only a lightweight classifier. The Morse theoretic approach achieves state-of-the-art performance on standard datasets while offering an interpretable feature vector and scalable method for ligand prioritization in early-stage drug discovery.

A Morse Transform for Drug Discovery

TL;DR

This work addresses ligand-based virtual screening under data scarcity by introducing a topology-driven descriptor based on piecewise-linear Morse theory. Ligands are modeled as pruned Delaunay complexes and analyzed across many directions to produce a 72-dimensional Morse feature vector that captures boundary topology via the Morse data of critical points; a lightweight classifier (LightGBM) is then used for binary active/decoy ranking. Chemistry-aware extensions further boost performance, achieving state-of-the-art AUROC on DUD-E (up to ) and strong results on MUV (up to ), while maintaining interpretability and scalability. The approach shows robustness to sampling depth and directional resolution, and demonstrates that explicit geometric-topological descriptors can rival or surpass deep learning methods in LBVS with far fewer training examples.

Abstract

We introduce a new ligand-based virtual screening (LBVS) framework that uses piecewise linear (PL) Morse theory to predict ligand binding potential. We model ligands as simplicial complexes via a pruned Delaunay triangulation, and catalogue the critical points across multiple directional height functions. This produces a rich feature vector, consisting of crucial topological features -- peaks, troughs, and saddles -- that characterise ligand surfaces relevant to binding interactions. Unlike contemporary LBVS methods that rely on computationally-intensive deep neural networks, we require only a lightweight classifier. The Morse theoretic approach achieves state-of-the-art performance on standard datasets while offering an interpretable feature vector and scalable method for ligand prioritization in early-stage drug discovery.

Paper Structure

This paper contains 27 sections, 9 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Two dimensional slices of a protein binding pocket along with four candidate ligands, $A, B, C$ and $D$. Here $A$ binds tightly with the target as drawn, while $B$ binds after realignment. Both $C$ and $D$ are geometrically incompatible with the target. Note that the boundary of the binding pocket, drawn here as a $W$-shaped curve, must be (at least approximately reflected) in the surfaces of $A$ and $B$ for tight binding to be possible. Although this $W$-shaped region appears in the boundary of $D$, it is obstructed by the protrusion occurring on the left side.
  • Figure 2: The three critical values $a_1, a_2$ and $a_3$ of the $W$-shaped boundary region from Figure \ref{['fig:dock']} for the vertical height function. These occur precisely at the heights of the critical points $x_1, x_2$ and $x_3$ where the tangent space is horizontal. Note that the boundaries of ligands $A$ and $B$ would from Figure \ref{['fig:dock']} would exhibit this critical value pattern along some direction. The ligand $C$ has completely different critical values from all directions, and ligand $D$ exhibits two additional critical values due to the obstructive protrusion in its boundary.
  • Figure 3: The neighbourhood of the vertex $v$ in the illustrated simplicial complex has vertices $\{u_1,\ldots,u_8\}$; its higher-dimensional simplices are $(u_1u_2u_3v), (u_1u_7v), (u_4u_6v), (u_5v), (u_7u_8v)$ plus all their faces. The upper link of $v$ with respect to the vertical height function $f_\xi$ is the subcomplex of this neighbourhood generated by those neighbours which are higher than $v$ -- explicitly, this is the blue region containing the 2-simplex $(u_1u_2u_3)$ plus all its faces along with the isolated vertex $u_4$. The reduced Betti number of the upper link is non-trivial in dimension 0, it follows that $v$ is a critical vertex for $f_\xi$. If we considered the same figure rotated clockwise by 90 degrees so that only $\{u_1, u_2, u_7,u_8\}$ were above $v$, then the upper link would have trivial Betti numbers and $v$ would be non-critical for the corresponding height function.
  • Figure 4: The mean AUROC score against depth of our LGBM classifier trained on Morse feature vectors computed using 100 directions (a), 32 pentakis dodecahedral directions (b), 12 icosahedral directions (c), 8 cubic directions (d) and 1 direction (e) for the D8 subset. Error bars are 95% confidence intervals.
  • Figure 5: The mean AUROC score against depth of the LGBM classifier trained on chemically-enhanced Morse feature vectors computed using 32 directions (a) and Morse feature vectors computed using 100 directions (c) for the D8 subset. For comparison, the best-performing external shape and chemistry-based feature UCT (b) and the best performing external shape-based feature Wu (d) are plotted with dotted lines. Error bars are 95% confidence intervals.
  • ...and 6 more figures