Table of Contents
Fetching ...

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

Daniel Mas Montserrat, Ray Verma, Míriam Barrabés, Francisco M. de la Vega, Carlos D. Bustamante, Alexander G. Ioannidis

TL;DR

Large-scale precision medicine genomic pipelines process tens to hundreds of gigabytes per sample, causing memory spikes and failures under naive allocations. The authors introduce three RAM-aware chromosome-parallelization mechanisms: a static ordering to minimize peak memory under fixed concurrency, a dynamic knapsack-style packing with online RAM predictions, and a compact symbolic regression RAM predictor distilled from an ensemble teacher, augmented by a conformal bound for safety. In simulations and real-world Beagle-based workflows, the dynamic scheduler reduces makespan and overcommits, while the symbolic predictor provides strong, deployable priors enabling immediate concurrency; integration into the StrataRisk PRS pipeline yields substantial speedups and cost savings. Together, these methods enable memory-safe, scalable chromosome-level genomics across heterogeneous compute environments, with practical impact on turnaround times and clinical decision support.

Abstract

Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

TL;DR

Large-scale precision medicine genomic pipelines process tens to hundreds of gigabytes per sample, causing memory spikes and failures under naive allocations. The authors introduce three RAM-aware chromosome-parallelization mechanisms: a static ordering to minimize peak memory under fixed concurrency, a dynamic knapsack-style packing with online RAM predictions, and a compact symbolic regression RAM predictor distilled from an ensemble teacher, augmented by a conformal bound for safety. In simulations and real-world Beagle-based workflows, the dynamic scheduler reduces makespan and overcommits, while the symbolic predictor provides strong, deployable priors enabling immediate concurrency; integration into the StrataRisk PRS pipeline yields substantial speedups and cost savings. Together, these methods enable memory-safe, scalable chromosome-level genomics across heterogeneous compute environments, with practical impact on turnaround times and clinical decision support.

Abstract

Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.

Paper Structure

This paper contains 38 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Relationship between human chromosome number and its size.
  • Figure 2: Optimized order of chromosomes for the static scheduler for $K=2,3,5$. Each cross represents, for a given step, the chromosome being processed. A moving averaged chromosome number (orange line) indicates a balance between long and short chromosomes.
  • Figure 3: Scheduler Module Evaluation: (Packer Comparison) the knapsack packing produces the closest results to the theoretical limit. (With and Without LR Bias) Including LR Bias showcases a decrease in overcommits, without affecting the makespan. (Initialization type vs Random Order) the Smallest First initialization order produces the lowest makespan compared to the random initialization set. (Effect of Priors) Effect of incorporating priors given a task size.
  • Figure 4: RAM prediction results:(Predicted RAM vs. variants and samples) heatmap of $\widehat{y}$ for fixed $(V_{\mathrm{ref}},S_{\mathrm{ref}},\mathrm{Thr})$; (Pearson correlation) teacher ensemble vs. symbolic regressor (with/without distillation); (Mean absolute error) MAE at test set; (Tree Ensembles Scatter) predicted vs. true RAM for the ensemble teacher; (Symbolic w/ distillation Scatter) predicted vs. true RAM for the distilled symbolic model; dashed curve shows the 80th-percentile adjustment.
  • Figure 5: Deployed impact of conservative priors.Beagle makespan and overcommits in StrataRisk™. The calibrated priors nearly halves wall-clock time of the dynamic knapsack scheduler relative to the no-prior baseline.