Scalable Distributed String Sorting

Florian Kurpicz; Pascal Mehnert; Peter Sanders; Matthias Schimek

Scalable Distributed String Sorting

Florian Kurpicz, Pascal Mehnert, Peter Sanders, Matthias Schimek

TL;DR

String sorting on distributed-memory machines faces scalability limits due to latency and communication overhead. The authors propose a multi-level string sorting framework that recursively partitions processors into groups and performs LCP-aware merging, with a prefix-doubling variant that drives per-level work toward the sum of distinguishing prefixes $D$. They introduce building blocks such as an $r$-way LCP loser-tree merge, an improved robust hypercube quicksort for strings, and a multi-level Bloom filter, along with analyses demonstrating near-$N$ (or near-$D$ for the prefix-doubling variant) per-level work and communication. Empirically, the approach scales to $p=49152$ cores and achieves up to 5x speedups over the prior distributed string sorting methods, enabling efficient large-scale indexing and text-processing pipelines on HPC systems.

Abstract

String sorting is an important part of tasks such as building index data structures. Unfortunately, current string sorting algorithms do not scale to massively parallel distributed-memory machines since they either have latency (at least) proportional to the number of processors $p$ or communicate the data a large number of times (at least logarithmic). We present practical and efficient algorithms for distributed-memory string sorting that scale to large $p$. Similar to state-of-the-art sorters for atomic objects, the algorithms have latency of about $p^{1/k}$ when allowing the data to be communicated $k$ times. Experiments indicate good scaling behavior on a wide range of inputs on up to 49152 cores. Overall, we achieve speedups of up to 5 over the current state-of-the-art distributed string sorting algorithms.

Scalable Distributed String Sorting

TL;DR

. They introduce building blocks such as an

-way LCP loser-tree merge, an improved robust hypercube quicksort for strings, and a multi-level Bloom filter, along with analyses demonstrating near-

(or near-

for the prefix-doubling variant) per-level work and communication. Empirically, the approach scales to

cores and achieves up to 5x speedups over the prior distributed string sorting methods, enabling efficient large-scale indexing and text-processing pipelines on HPC systems.

Abstract

or communicate the data a large number of times (at least logarithmic). We present practical and efficient algorithms for distributed-memory string sorting that scale to large

. Similar to state-of-the-art sorters for atomic objects, the algorithms have latency of about

when allowing the data to be communicated

times. Experiments indicate good scaling behavior on a wide range of inputs on up to 49152 cores. Overall, we achieve speedups of up to 5 over the current state-of-the-art distributed string sorting algorithms.

Paper Structure (10 sections, 1 theorem, 1 figure, 1 table)

This paper contains 10 sections, 1 theorem, 1 figure, 1 table.

Introduction
Related Work.
Our Contribution.
Preliminaries
Machine Model and Communication Primitives.
String Properties and Input Format.
Algorithmic Building Blocks.
Multi-Level String Sorting
Distributed Partitioning
String-Based Partitioning

Key Result

Theorem 1

If all input strings are unique, RQuick runs in time $\mathcal{O}\left( \hat{\ell} \frac{n}{p} \log n + \alpha \log^2p + \beta \left(\frac{n}{p} \hat{\ell} \log p + \hat{\ell} \log^2p \right)\log{\sigma} \right)$ with probability $\ge 1 - p^{-c}$ for any constant $c>0$.

Figures (1)

Figure 1: Overview of the main steps in the multi-level string sorting scheme with $k=2$ levels.

Theorems & Definitions (1)

Theorem 1: String RQuick,DBLP:conf/ipps/Bingmann0S20

Scalable Distributed String Sorting

TL;DR

Abstract

Scalable Distributed String Sorting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (1)