Rapid GPU-Based Pangenome Graph Layout
Jiajie Li, Jan-Niklas Schmelzle, Yixiao Du, Simon Heumos, Andrea Guarracino, Giulia Guidi, Pjotr Prins, Erik Garrison, Zhiru Zhang
TL;DR
This work tackles the costly problem of pangenome graph layout by introducing a GPU-accelerated solution for the path-guided SGD layout algorithm. It identifies the workload's data-level parallelism and memory-bound nature, and delivers three targeted optimizations—cache-friendly data layout, coalesced random states, and warp merging—along with a scalable quality metric called sampled path stress. On 24 human chromosomal pangenomes, the approach achieves an average of $57.3\times$ speedup over a 32-core CPU baseline, reducing runtime from hours to minutes while preserving layout quality as indicated by the correlation between $path\_stress$ and $sampled\_path\_stress$ across test cases. The work includes an ablation study and a case study of performance-quality trade-offs, and it is designed to be integrated into the ODGI framework for broad adoption and interactive pangenome visualization.
Abstract
Computational Pangenomics is an emerging field that studies genetic variation using a graph structure encompassing multiple genomes. Visualizing pangenome graphs is vital for understanding genome diversity. Yet, handling large graphs can be challenging due to the high computational demands of the graph layout process. In this work, we conduct a thorough performance characterization of a state-of-the-art pangenome graph layout algorithm, revealing significant data-level parallelism, which makes GPUs a promising option for compute acceleration. However, irregular data access and the algorithm's memory-bound nature present significant hurdles. To overcome these challenges, we develop a solution implementing three key optimizations: a cache-friendly data layout, coalesced random states, and warp merging. Additionally, we propose a quantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline without layout quality loss, reducing execution time from hours to minutes.
