Table of Contents
Fetching ...

Empowering Chemical Structures with Biological Insights for Scalable Phenotypic Virtual Screening

Xiaoqing Lian, Pengsen Ma, Tengfeng Ma, Zhonghao Ren, Xibao Cai, Zhixiang Cheng, Bosheng Song, He Wang, Xiang Pan, Yangyang Chen, Sisi Yuan, Chen Lin

Abstract

Motivation: The scalable identification of bioactive compounds is essential for contemporary drug discovery. This process faces a key trade-off: structural screening offers scalability but lacks biological context, whereas high-content phenotypic profiling provides deep biological insights but is resource-intensive. The primary challenge is to extract robust biological signals from noisy data and encode them into representations that do not require biological data at inference. Results: This study presents DECODE (DEcomposing Cellular Observations of Drug Effects), a framework that bridges this gap by empowering chemical representations with intrinsic biological semantics to enable structure-based in silico biological profiling. DECODE leverages limited paired transcriptomic and morphological data as supervisory signals during training, enabling the extraction of a measurement-invariant biological fingerprint from chemical structures and explicit filtering of experimental noise. Our evaluations demonstrate that DECODE retrieves functionally similar drugs in zero-shot settings with over 20% relative improvement over chemical baselines in mechanism-of-action (MOA) prediction. Furthermore, the framework achieves a 6-fold increase in hit rates for novel anti-cancer agents during external validation. Availability and implementation: The codes and datasets of DECODE are available at https://github.com/lian-xiao/DECODE.

Empowering Chemical Structures with Biological Insights for Scalable Phenotypic Virtual Screening

Abstract

Motivation: The scalable identification of bioactive compounds is essential for contemporary drug discovery. This process faces a key trade-off: structural screening offers scalability but lacks biological context, whereas high-content phenotypic profiling provides deep biological insights but is resource-intensive. The primary challenge is to extract robust biological signals from noisy data and encode them into representations that do not require biological data at inference. Results: This study presents DECODE (DEcomposing Cellular Observations of Drug Effects), a framework that bridges this gap by empowering chemical representations with intrinsic biological semantics to enable structure-based in silico biological profiling. DECODE leverages limited paired transcriptomic and morphological data as supervisory signals during training, enabling the extraction of a measurement-invariant biological fingerprint from chemical structures and explicit filtering of experimental noise. Our evaluations demonstrate that DECODE retrieves functionally similar drugs in zero-shot settings with over 20% relative improvement over chemical baselines in mechanism-of-action (MOA) prediction. Furthermore, the framework achieves a 6-fold increase in hit rates for novel anti-cancer agents during external validation. Availability and implementation: The codes and datasets of DECODE are available at https://github.com/lian-xiao/DECODE.
Paper Structure (14 sections, 5 equations, 5 figures)

This paper contains 14 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: (a) The DECODE Framework for Modal Augmentation in Drug Discovery. Constructing a Unified Biological Consensus: The architecture integrates chemical structures with high-content transcriptomic and morphological profiles. It uses Contrastive Learning to align heterogeneous views into a shared latent space and Orthogonal Constraints to separate the measurement-invariant biological signal from modality-specific artifacts. A self-reconstruction task with modality masking ensures the learned fingerprint is robust to missing data. (b) Structure-Only Inference and Applications: The trained model enables high-fidelity in silico biological profiling using only chemical inputs. It supports Zero-Shot Retrieval, identifying functionally similar drugs despite structural diversity. For virtual screening, a 'Generate-Refine-Enhance' pipeline integrates biological context, achieving a 6-fold increase in hit rates for novel active compounds compared to standard methods.
  • Figure 2: Geometric Analysis of Latent Disentanglement: t-SNE visualizations of the learned feature spaces. The plots reveal that the Shared Encoder successfully aligns heterogeneous modalities into a unified biological consensus (overlapping clusters in Shared Features), while the Orthogonal Constraints force modality-specific artifacts into distinct, non-overlapping subspaces (Unique Features), confirming effective signal purification.
  • Figure 3: Zero-Shot Functional Retrieval and Generalization to Novel Chemical Spaces. Quantitative Retrieval Evaluation: Comparative analysis of retrieval metrics (Recall, Precision, and Mean Average Precision vs. Enrichment) in Novel Chemical Space. The DECODE-BM variant (Dual-Profiles Missing, or Structure-Only Inference) consistently outperforms single-modality baselines, showing the model's ability to generalize biological insights to unseen chemical entities without wet-lab data. (b) Visualizing Functional Alignment: t-SNE projections highlight the 'Chelating Agent' class (e.g., Clioquinol and Dipyrocetyl). Despite significant structural dissimilarity, as shown by the dispersed Drug Embedding, DECODE's biological fingerprint clusters these functionally related drugs. This semantic grouping remains robust even in the Dual-Profiles Missing scenario, confirming that the model has disentangled the shared Mechanism of Action signal from structural and experimental variations.
  • Figure 4: Comparative Mechanism of Action (MOA) Prediction and Geometric Disentanglement Analysis. Performance is compared using Macro F1-Score, Precision, and Recall across the LINCS (a) and CDRP (b) datasets. DECODE consistently outperforms Single-View (Structure or Bio Only) and standard Fusion baselines (Early or Late Fusion).
  • Figure 5: (a), DECODE outperforms a structure-only model (Molformer) in MOA prediction on the NOCA dataset. (b), t-SNE visualization reveals that DECODE learns a more coherent latent space, grouping functionally related drugs, such as the highlighted sodium channel blockers, more effectively. (c), In external anti-cancer screening, DECODE achieves a higher AUC than the chemical baseline. (d) The structural visualization of sodium channel blockers.DECODE achieves a higher macro-AUC in predicting drug pathways on the MCELC dataset. The shaded area represents the 95$\%$ confidence interval derived from the distribution of AUCs across all pathway classes.