Parallel Integer Sort: Theory and Practice

Xiaojun Dong; Laxman Dhulipala; Yan Gu; Yihan Sun

Parallel Integer Sort: Theory and Practice

Xiaojun Dong, Laxman Dhulipala, Yan Gu, Yihan Sun

TL;DR

The paper tackles parallel integer sorting by bridging theory and practice. It introduces DovetailSort (DTSort), a stable, MSD-based algorithm that detects and leverages duplicate keys via sampling and heavy-light partitioning, then interleaves heavy and light buckets with a dovetail merge. The authors prove that a broad class of practical MSD IS algorithms achieves $O(n\sqrt{\log r})$ work and polylogarithmic-to-tilde span, with DTSort attaining $O(n\sqrt{\log r})$ work and $\tilde{O}(2^{\sqrt{\log r}})$ span, and even $O(n)$ work for certain distributions. Empirically, DTSort matches or outperforms state-of-the-art parallel IS and comparison sorts on synthetic and real-world data, especially with many duplicates, while maintaining scalability up to hundreds of cores. Overall, the work offers both a solid theoretical foundation and a practical, robust sorter for large-scale integer data.

Abstract

Integer sorting is a fundamental problem in computer science. This paper studies parallel integer sort both in theory and in practice. In theory, we show tighter bounds for a class of existing practical integer sort algorithms, which provides a solid theoretical foundation for their widespread usage in practice and strong performance. In practice, we design a new integer sorting algorithm, \textsf{DovetailSort}, that is theoretically-efficient and has good practical performance. In particular, \textsf{DovetailSort} overcomes a common challenge in existing parallel integer sorting algorithms, which is the difficulty of detecting and taking advantage of duplicate keys. The key insight in \textsf{DovetailSort} is to combine algorithmic ideas from both integer- and comparison-sorting algorithms. In our experiments, \textsf{DovetailSort} achieves competitive or better performance than existing state-of-the-art parallel integer and comparison sorting algorithms on various synthetic and real-world datasets.

Parallel Integer Sort: Theory and Practice

TL;DR

work and polylogarithmic-to-tilde span, with DTSort attaining

work and

span, and even

work for certain distributions. Empirically, DTSort matches or outperforms state-of-the-art parallel IS and comparison sorts on synthetic and real-world data, especially with many duplicates, while maintaining scalability up to hundreds of cores. Overall, the work offers both a solid theoretical foundation and a practical, robust sorter for large-scale integer data.

Abstract

Paper Structure (25 sections, 7 theorems, 36 figures, 4 tables, 3 algorithms)

This paper contains 25 sections, 7 theorems, 36 figures, 4 tables, 3 algorithms.

Introduction
Preliminaries
Notations
Computational Models
Sorting and Integer Sorting
Counting Sort (aka. Distribution)
Comparison Sort, Semisort, and Sampling
The DovetailSort Algorithm
Step 1: Sampling
Step 2: Distributing
Step 3: Recursing
Step 4: Dovetail Merging
Base Cases
The Theory of Parallel MSD Sort
The Analysis of the General MSD Sort
...and 10 more sections

Key Result

Theorem 4.1

There exists an unstable parallel MSD sorting algorithm with $O(n\sqrt{\log r})$ work and $O(\log r + \sqrt{\log r}\log n)$ span whp.

Figures (36)

Figure 1: Heatmap to compare sorting algorithms on $10^9$ records with 32-bit keys and 32-bit values. All numbers are running times relative to the best for each input. Raw data are in \ref{['tab:synthetic']}. The baseline algorithms are described in \ref{['tab:baseline']}.
Figure 2: An overview of the approach in the DTSort. Here $r=16, \mathit{\gamma}=2$. For simplicity and space limit, the sampling scheme in the figure is not exactly accurate as described in the algorithm. Here we simply set keys with 2 or more samples as heavy keys.
Figure 3: Illustration of the dovetail merging step. The example merges the buckets in MSD zone 01 in \ref{['fig:sort']}. We use a letter as subscription to distinguish different records with the same key.
Figure 4: (a) and (b): Analysis for the performance of heavy-key detection. Numbers are running time (lower is better) with or without heavy-key detection. (a) is for 32-bit keys and (b) is for 64-bit keys. (c) and (d): Analysis for the performance of dovetail merging. Numbers are running time (lower is better) using our dovetail merging algorithm or a baseline merging algorithm. (c) is for 32-bit keys and (d) is for 64-bit keys. (e) and (f): Scalability (higher is better) with varying number of threads and running time (lower is better) with varying input sizes on 32-bit key and 32-bit value pairs on one instance: Zipf-0.8. Full analysis is given in the full paper dong2024parallelfull\ref{['sec:app-scalability']}. Discussions are in \ref{['sec:exp-study']}.
Figure 5: Self-speedup with varying thread counts of all tested implementations on Unif--$\boldsymbol{10^7}$.
...and 31 more figures

Theorems & Definitions (7)

Theorem 4.1
Lemma 4.2
Lemma 4.3
Theorem 4.4
Theorem 4.5
Theorem 4.6
Theorem 4.7

Parallel Integer Sort: Theory and Practice

TL;DR

Abstract

Parallel Integer Sort: Theory and Practice

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (36)

Theorems & Definitions (7)