High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Yifan Li; Giulia Guidi

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Yifan Li, Giulia Guidi

TL;DR

HySortK tackles the bottleneck of distributed $k$-mer counting by replacing hash-table based approaches with a sorting-based pipeline that leverages in-place radix sorting and a one-pass exchange. The method introduces a domain-specific supermer strategy and a task-based abstraction layer to enable flexible hybrid MPI+OpenMP parallelism, improving load balancing and reducing communication. It demonstrates up to 2-10× speedups over GPU baselines and up to 2× over CPU state-of-the-art, while also reducing peak memory usage by up to 30% and achieving end-to-end pipeline speedups (e.g., up to $1.8\times$ in ELBA). The work provides strong empirical evidence across diverse datasets and hardware, showing robust strong/weak scaling and practical applicability in real-world genome assembly workflows.

Abstract

In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

TL;DR

HySortK tackles the bottleneck of distributed

-mer counting by replacing hash-table based approaches with a sorting-based pipeline that leverages in-place radix sorting and a one-pass exchange. The method introduces a domain-specific supermer strategy and a task-based abstraction layer to enable flexible hybrid MPI+OpenMP parallelism, improving load balancing and reducing communication. It demonstrates up to 2-10× speedups over GPU baselines and up to 2× over CPU state-of-the-art, while also reducing peak memory usage by up to 30% and achieving end-to-end pipeline speedups (e.g., up to

in ELBA). The work provides strong empirical evidence across diverse datasets and hardware, showing robust strong/weak scaling and practical applicability in real-world genome assembly workflows.

Abstract

Paper Structure (25 sections, 10 figures, 4 tables)

This paper contains 25 sections, 10 figures, 4 tables.

Introduction
Background
$k$-mer Counting
Parallel $k$-mer Counting
Parallel Sorting
Supermer
Methodologies
Sorting-Based $k$-mer Counting
Optimized Supermer Strategy
Communication Optimization
Communication and computation overlap
Data compression
Task Abstraction Layer
Optimized Load Balance
Experimental Results
...and 10 more sections

Figures (10)

Figure 1: Overview of the common paradigm for distributed memory $k$-mer counting pipelines using hash-tables.
Figure 2: Overview of sorting-based $k$-mer counting using the supermer strategy.
Figure 3: Overview of HySortK paradigm for distributed memory $k$-mer counting with hybrid task parallelism.
Figure 4: HySortK strong scaling performance on the H. sapiens 10x dataset using $k=31$.
Figure 5: HySortK weak scaling performance using $k=31$.
...and 5 more figures

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

TL;DR

Abstract

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

Authors

TL;DR

Abstract

Table of Contents

Figures (10)