Table of Contents
Fetching ...

Deep Active Learning based Experimental Design to Uncover Synergistic Genetic Interactions for Host Targeted Therapeutics

Haonan Zhu, Mary Silva, Jose Cadena, Braden Soper, Michał Lisicki, Braian Peetoom, Sergio E. Baranzini, Shivshankar Sundaram, Priyadip Ray, Jeff Drocco

TL;DR

This work tackles the slow exploration of synergistic host gene pairs for HIV inhibition by introducing a Deep Active Learning framework that leverages the SPOKE knowledge graph to efficiently navigate a 356×356 double-knockdown space. The method combines a Relational Graph Convolutional Network, DistMult edge prediction, and a bilinear regression predictor within an ensemble-based uncertainty framework, guided by multiple acquisition strategies. Empirical results show the approach identifies the top gene-pair candidates with minimal experimental data (92% of top 400 with <6.3% of the matrix observed), and pathway analyses reveal biologically meaningful processes such as translation are captured. The framework yields interpretable gene representations and suggests future enhancements like warm-starting with large language models and incorporating richer node features for improved efficiency and biological insight.

Abstract

Recent technological advances have introduced new high-throughput methods for studying host-virus interactions, but testing synergistic interactions between host gene pairs during infection remains relatively slow and labor intensive. Identification of multiple gene knockdowns that effectively inhibit viral replication requires a search over the combinatorial space of all possible target gene pairs and is infeasible via brute-force experiments. Although active learning methods for sequential experimental design have shown promise, existing approaches have generally been restricted to single-gene knockdowns or small-scale double knockdown datasets. In this study, we present an integrated Deep Active Learning (DeepAL) framework that incorporates information from a biological knowledge graph (SPOKE, the Scalable Precision Medicine Open Knowledge Engine) to efficiently search the configuration space of a large dataset of all pairwise knockdowns of 356 human genes in HIV infection. Through graph representation learning, the framework is able to generate task-specific representations of genes while also balancing the exploration-exploitation trade-off to pinpoint highly effective double-knockdown pairs. We additionally present an ensemble method for uncertainty quantification and an interpretation of the gene pairs selected by our algorithm via pathway analysis. To our knowledge, this is the first work to show promising results on double-gene knockdown experimental data of appreciable scale (356 by 356 matrix).

Deep Active Learning based Experimental Design to Uncover Synergistic Genetic Interactions for Host Targeted Therapeutics

TL;DR

This work tackles the slow exploration of synergistic host gene pairs for HIV inhibition by introducing a Deep Active Learning framework that leverages the SPOKE knowledge graph to efficiently navigate a 356×356 double-knockdown space. The method combines a Relational Graph Convolutional Network, DistMult edge prediction, and a bilinear regression predictor within an ensemble-based uncertainty framework, guided by multiple acquisition strategies. Empirical results show the approach identifies the top gene-pair candidates with minimal experimental data (92% of top 400 with <6.3% of the matrix observed), and pathway analyses reveal biologically meaningful processes such as translation are captured. The framework yields interpretable gene representations and suggests future enhancements like warm-starting with large language models and incorporating richer node features for improved efficiency and biological insight.

Abstract

Recent technological advances have introduced new high-throughput methods for studying host-virus interactions, but testing synergistic interactions between host gene pairs during infection remains relatively slow and labor intensive. Identification of multiple gene knockdowns that effectively inhibit viral replication requires a search over the combinatorial space of all possible target gene pairs and is infeasible via brute-force experiments. Although active learning methods for sequential experimental design have shown promise, existing approaches have generally been restricted to single-gene knockdowns or small-scale double knockdown datasets. In this study, we present an integrated Deep Active Learning (DeepAL) framework that incorporates information from a biological knowledge graph (SPOKE, the Scalable Precision Medicine Open Knowledge Engine) to efficiently search the configuration space of a large dataset of all pairwise knockdowns of 356 human genes in HIV infection. Through graph representation learning, the framework is able to generate task-specific representations of genes while also balancing the exploration-exploitation trade-off to pinpoint highly effective double-knockdown pairs. We additionally present an ensemble method for uncertainty quantification and an interpretation of the gene pairs selected by our algorithm via pathway analysis. To our knowledge, this is the first work to show promising results on double-gene knockdown experimental data of appreciable scale (356 by 356 matrix).

Paper Structure

This paper contains 21 sections, 5 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Flowchart of the proposed deep active learning framework.
  • Figure 2: Comparison among the different acquisition strategies, and results are summarized over $20$ replicates. Optimism with $10\%$ quantiles achieves the best performance in terms of coverage at terminal phase, and uncovers 92% of the top $400$ gene-pairs while only $<6.3\%$ of the entire matrix is observed. Maximum variance strategy performs the best in learning a generalizable model that predicts viral-replicates on unseen data due to an emphasis on exploration.
  • Figure 3: Abalation studies to validate our proposed approach, and results are summarized over $20$ replicates. The comparison between the base model and ensemble approaches shows that ensemble method is able to provides a meaningful uncertainty quantification that benefits coverage in the long-term; The comparison between DeepAL-Ensemble versus UF-Ensemble shows that the graph representation of the models from SPOKE offers a more effective representation of the genes for downstream tasks; The comparison between the DeepAL-ensemble and DeepAL-Ensemble with random initialization shows that there is meaningful information in the SPOKE graph that can benefits the search for best genes in the early rounds, but the benefits diminish rapidly as more experimental data is collected; The comparison between DeepAL-Ensemble and FF-Ensemble shows that there is significant benefits to fine-tuning the embeddings as more data is being collected.
  • Figure 4: Pathway enrichment analysis on the gene-pairs selected by DeepAL. The left panel shows the biological process terms selected optimizing by maximum variance, and the right panel shows the terms enriched when optimizing by maximum optimism. The $x$-axis represents the rounds, while the $y$-axis shows the total number of Gene Ontology (GO) biological process terms that are most frequently selected by the algorithm.