Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows
Daniel Mas Montserrat, Ray Verma, Míriam Barrabés, Francisco M. de la Vega, Carlos D. Bustamante, Alexander G. Ioannidis
TL;DR
Large-scale precision medicine genomic pipelines process tens to hundreds of gigabytes per sample, causing memory spikes and failures under naive allocations. The authors introduce three RAM-aware chromosome-parallelization mechanisms: a static ordering to minimize peak memory under fixed concurrency, a dynamic knapsack-style packing with online RAM predictions, and a compact symbolic regression RAM predictor distilled from an ensemble teacher, augmented by a conformal bound for safety. In simulations and real-world Beagle-based workflows, the dynamic scheduler reduces makespan and overcommits, while the symbolic predictor provides strong, deployable priors enabling immediate concurrency; integration into the StrataRisk PRS pipeline yields substantial speedups and cost savings. Together, these methods enable memory-safe, scalable chromosome-level genomics across heterogeneous compute environments, with practical impact on turnaround times and clinical decision support.
Abstract
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.
