High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism
Yifan Li, Giulia Guidi
TL;DR
HySortK tackles the bottleneck of distributed $k$-mer counting by replacing hash-table based approaches with a sorting-based pipeline that leverages in-place radix sorting and a one-pass exchange. The method introduces a domain-specific supermer strategy and a task-based abstraction layer to enable flexible hybrid MPI+OpenMP parallelism, improving load balancing and reducing communication. It demonstrates up to 2-10× speedups over GPU baselines and up to 2× over CPU state-of-the-art, while also reducing peak memory usage by up to 30% and achieving end-to-end pipeline speedups (e.g., up to $1.8\times$ in ELBA). The work provides strong empirical evidence across diverse datasets and hardware, showing robust strong/weak scaling and practical applicability in real-world genome assembly workflows.
Abstract
In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.
