Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer

Ritvik Prabhu; Emil Vatai; Bernard Moussad; Emmanuel Jeannot; Ramu Anandakrishnan; Wu-chun Feng; Mohamed Wahib

Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer

Ritvik Prabhu, Emil Vatai, Bernard Moussad, Emmanuel Jeannot, Ramu Anandakrishnan, Wu-chun Feng, Mohamed Wahib

Abstract

Cancer typically arises not from a single genetic mutation (i.e., hit) but from multi-hit combinations that accumulate within cells. However, enumerating multi-hit combinations becomes exponentially more expensive computationally as the number of candidate hit gene combinations grow, i.e. on the order of 20,000 choose h, where 20,000 is the number of genes in the human genome and h is the number of hits. To address this challenge, we present an algorithmic framework, called Pruned Depth-First Search (P-DFS) that leverages the high sparsity in tumor mutation data to prune large portions of the search space. Specifically, P-DFS (the main contribution of this paper) - a pruning technique that exploits sparsity to drastically reduce the otherwise exponential h-hit search space for candidate combinations used by Weighted Set Cover - which is grounded in a depth-first search backtracking technique, prunes infeasible gene subsets early, while a weighted set cover formulation systematically scores and selects the most discriminative combinations. By intertwining these ideas with optimized bitwise operations and a scalable distributed algorithm on high-performance computing clusters, our algorithm can achieve approximately 90 - 98% reduction in visited combinations for 4-hits, and roughly a 183x speedup over the exhaustive set cover approach(which is algorithmically NP-complete) measured on 147,456 ranks. In doing so, our method can feasibly handle four-hit and even higher-order gene hits, achieving both speed and resource efficiency.

Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer

Abstract

Paper Structure (44 sections, 3 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 44 sections, 3 equations, 10 figures, 4 tables, 2 algorithms.

Introduction
Background and Related Work
Historical Foundations of the Multi-Hit Concept
Approaches for Identifying Multi-Hit Combinations
Weighted Set Cover (WSC) Paradigm
Scoring and Algorithmic Steps
Computational Complexity
Pruned and Parallel DFS for Candidate Generation
Graph-Based Representations
Speed vs. Coverage
Pruned Depth-First Search (P-DFS): Pruning Algorithm for $h$-hit Gene Mutations
Motivation
Overview
Preprocessing
Core Algorithm
...and 29 more sections

Figures (10)

Figure 1: Overview of workflow.Input: tumor and normal mutation matrices, represented as per-gene bitsets over samples. --Previous method (exhaustive enumeration): for every$h$-hit gene combination, the prior pipeline performs a bitwise AND across the corresponding gene rows () and computes over the full candidate table (), which quickly becomes intractable as $h$ grows. --This paper (prune-then-evaluate): our main contribution is P-DFS, a sparsity-driven pruned depth-first search that massively prunes the $h$-hit search space () before scoring (see §\ref{['sec:algorithm']} and Fig. \ref{['fig:new-workflow']}). This produces a substantially smaller candidate table to evaluate (), yielding significant compute-time savings by avoiding bitwise-AND intersections for the vast majority of combinations. and Selection step (common to both): choose the candidate combination with highest tumor coverage and lowest normal coverage, and pass it into the weighted set cover procedure . The table at the top summarizes the resulting reduction in combinations explored, which becomes more pronounced as the number of hits $h$ increases.
Figure 2: Visualization of estimated hit numbers for each cancer type from anandakrishnan2019estimating. Derived from the public domain image by M. Haggstrom (2014) wiki:Hag.
Figure 3: Measured sparsity of all the carcinogenesis mutation matrix datasets available through TCGA. Median: 95.61%. High sparsity underscores the rarity of mutations in all of the cancer datasets.
Figure 4: Work Distribution Overview: We use one MPI rank per core and two communicators: a global communicator among node leaders and a local communicator within each node. The local Rank 0 of each node is the node leader and also belongs to the global communicator. Because the database is only a few MB, Rank 0 reads it from disk and broadcasts the entire database to all node leaders. Each node then computes a distinct subset of $\lambda$ values, i.e., parallelism is in compute. Each node leader broadcasts the data within its node via the local communicator. The leader tracks outstanding $\lambda$ values and hands out fixed-sized work chunks to the workers. When a worker finishes its portion of the compute, it sends a short req to the leader. If work remains, the leader issues another chunk, otherwise the worker becomes idle. If a leader exhausts its local queue, it randomly selects a peer to steal from. A busy victim donates roughly half of its remaining range to the thief, while an idle peer triggers a retry with another node. Termination is decided with a circulating token that starts at Rank 0 in the white state (no work observed) and moves around leaders in ring order. Any node that performs work or donates flips the token to black (work observed). If the token returns to Rank 0 as black, it is reset to white and circulated again. If it returns white twice in a row, all the work to be computed on has been distributed, and a termination signal is sent.
Figure 5: Pruned depth-first search (P-DFS): pruning algorithm for $h$-hit gene mutations (with $h=4$): In the preprocessing step, the input data is sorted based on row sparsity to improve pruning efficiency. The first two of the four nested loops, with iterators $g_1$ and $g_2$ are flattened into a single $\lambda$-loop to enable work distribution with a high number of workers. In each iteration of the $\lambda$-loop the partial combination using only rows $g_1$ and $g_2$ are computed. If all the tumor samples are zero/false, the combination cannot contribute to the set cover (empty cover), nor would the resulting, higher hit combinations, so the entire sub-tree can be skipped . Otherwise, the algorithm proceeds to the depth 3 nested loop of the search tree, computing the 3-hit combination (using the 2-hit partial results) and similar eliminates empty covers or proceeds to depth 4 as needed . The $\gamma^{*}$ combination with the best $F(\gamma, X)$ value is saved and the samples (columns) covered by it are removed from the input database for the next iteration.
...and 5 more figures

Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer

Abstract

Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer

Authors

Abstract

Table of Contents

Figures (10)