Table of Contents
Fetching ...

Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking

Jiabao Brad Wang, Siyuan Cao, Hongxuan Wu, Yiliang Yuan, Mustafa Misir

TL;DR

Protein–ligand docking remains highly context-dependent, so the paper develops MolAS, a lightweight algorithm selector that uses pretrained protein and ligand embeddings with a simple attentional pooler and residual decoder to predict per-algorithm performance. MolAS achieves meaningful in-domain gains, closing a notable fraction of the VBS–SBS gap with modest labeled data, and serves as a diagnostic tool to assess when algorithm selection is feasible. The key finding is that the main bottleneck is instability in solver rankings across pose-generation workflows rather than representational capacity, with strong cross-protocol generalization remaining challenging. Ablation and comparative analyses show the approach is data-efficient and architecture-agnostic within the small-data regime, but robust cross-protocol performance will require modeling workflow shifts or standardizing evaluation pipelines.

Abstract

Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, or protocol regimes. We introduce MolAS, a lightweight algorithm selection system that predicts per-algorithm performance from pretrained protein-ligand embeddings using attentional pooling and a shallow residual decoder. With only hundreds to a few thousand labelled complexes, MolAS achieves up to 15% absolute improvement over the single-best solver (SBS) and closes 17-66% of the Virtual Best Solver (VBS)-SBS gap across five diverse docking benchmarks. Analyses of reliability, embedding geometry, and solver-selection patterns show that MolAS succeeds when the oracle landscape exhibits low entropy and separable solver behaviour, but collapses under protocol-induced hierarchy shifts. These findings indicate that the main barrier to robust docking AS is not representational capacity but instability in solver rankings across pose-generation regimes, positioning MolAS as both a practical in-domain selector and a diagnostic tool for assessing when AS is feasible.

Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking

TL;DR

Protein–ligand docking remains highly context-dependent, so the paper develops MolAS, a lightweight algorithm selector that uses pretrained protein and ligand embeddings with a simple attentional pooler and residual decoder to predict per-algorithm performance. MolAS achieves meaningful in-domain gains, closing a notable fraction of the VBS–SBS gap with modest labeled data, and serves as a diagnostic tool to assess when algorithm selection is feasible. The key finding is that the main bottleneck is instability in solver rankings across pose-generation workflows rather than representational capacity, with strong cross-protocol generalization remaining challenging. Ablation and comparative analyses show the approach is data-efficient and architecture-agnostic within the small-data regime, but robust cross-protocol performance will require modeling workflow shifts or standardizing evaluation pipelines.

Abstract

Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, or protocol regimes. We introduce MolAS, a lightweight algorithm selection system that predicts per-algorithm performance from pretrained protein-ligand embeddings using attentional pooling and a shallow residual decoder. With only hundreds to a few thousand labelled complexes, MolAS achieves up to 15% absolute improvement over the single-best solver (SBS) and closes 17-66% of the Virtual Best Solver (VBS)-SBS gap across five diverse docking benchmarks. Analyses of reliability, embedding geometry, and solver-selection patterns show that MolAS succeeds when the oracle landscape exhibits low entropy and separable solver behaviour, but collapses under protocol-induced hierarchy shifts. These findings indicate that the main barrier to robust docking AS is not representational capacity but instability in solver rankings across pose-generation regimes, positioning MolAS as both a practical in-domain selector and a diagnostic tool for assessing when AS is feasible.

Paper Structure

This paper contains 27 sections, 7 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 2: Overview of MolAS for workflow-specific molecular docking algorithm selection. (Left) Label data acquisition and scoring. (Right) MolAS model.
  • Figure 3: Schematic of the algorithm selection mapping in the sense of rice1976algorithm. Each instance $\mathbf{x}$ is mapped to a feature vector $f(\mathbf{x})$, from which a predictor $g$ produces a vector $\hat{\mathbf s}$ of estimated performances for all algorithms in the portfolio $\mathcal{A}$. The selector chooses the algorithm with maximal predicted performance.
  • Figure 4: Candidate docking algorithms for algorithm selection
  • Figure 5: $s_{RMSD}(x;\lambda)$ curves when $\lambda\in\{1, 3, 5\}$.
  • Figure 6: The selection frequencies by MolAS (left bar chart) and by the VBS (middle bar chart) and their success rates under the relaxed ($\text{RMSD}\leq x~\text{\AA}\ \&\ \text{PB-valid}$) criterion (right bar chart) of the top-3 selected algorithms by MolAS in the concatenated 5-fold results over benchmarks. The benchmarks are ordered descending based on the VBS-SBS gap closed by MolAS under the relaxed criterion. The correct picks by MolAS (those that accords with the VBS) are coloured in deeper colours and the SBS method for each benchmark is in bold.
  • ...and 9 more figures