Table of Contents
Fetching ...

A Quick and Exact Method for Distributed Quantile Computation

Ivan Cao, Jaromir J. Saloni, David A. G. Harrison

TL;DR

This work tackles exact quantile computation in distributed data processing by introducing GK Select, a method that uses an approximate GK pivot to guide an exact, partition-based selection in Spark. GK Select achieves exact results while constraining communication to a constant number of broadcast/reduce stages, avoiding the full data shuffle of global sorts. The approach combines a GK Sketch-derived pivot with partition-level QuickPartition and a tree-reduced merge of candidate values, yielding time complexity comparable to GK Sketch on executors and favorable driver costs. Empirically, GK Select matches the latency of GK Sketch and outperforms Spark’s full sort by roughly 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster, illustrating practical impact for large-scale exact quantile queries.

Abstract

Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster.

A Quick and Exact Method for Distributed Quantile Computation

TL;DR

This work tackles exact quantile computation in distributed data processing by introducing GK Select, a method that uses an approximate GK pivot to guide an exact, partition-based selection in Spark. GK Select achieves exact results while constraining communication to a constant number of broadcast/reduce stages, avoiding the full data shuffle of global sorts. The approach combines a GK Sketch-derived pivot with partition-level QuickPartition and a tree-reduced merge of candidate values, yielding time complexity comparable to GK Sketch on executors and favorable driver costs. Empirically, GK Select matches the latency of GK Sketch and outperforms Spark’s full sort by roughly 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster, illustrating practical impact for large-scale exact quantile queries.

Abstract

Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster.

Paper Structure

This paper contains 27 sections, 56 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Performance with 10 cores
  • Figure 2: Performance with 30 cores