Table of Contents
Fetching ...

Workload-Aware Incremental Reclustering in Cloud Data Warehouses

Yipeng Liu, Renfei Zhou, Jiaqi Yan, Haunchen Zhang

TL;DR

WAIR is presented, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency, which achieves near-optimal query performance but incurs significantly lower reclustering cost with a theoretical upper bound.

Abstract

Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper bound. We further implement the algorithm into a prototype reclustering service and evaluate on standard benchmarks (TPC-H, DSB) and a real-world workload. Results show that WAIR improves query performance and reduces the overall cost compared to existing solutions.

Workload-Aware Incremental Reclustering in Cloud Data Warehouses

TL;DR

WAIR is presented, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency, which achieves near-optimal query performance but incurs significantly lower reclustering cost with a theoretical upper bound.

Abstract

Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper bound. We further implement the algorithm into a prototype reclustering service and evaluate on standard benchmarks (TPC-H, DSB) and a real-world workload. Results show that WAIR improves query performance and reduces the overall cost compared to existing solutions.
Paper Structure (34 sections, 5 theorems, 9 equations, 18 figures, 1 table, 2 algorithms)

This paper contains 34 sections, 5 theorems, 9 equations, 18 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

We say an output $P[c_j, d_j]$ is matched to an input $P[a_i, b_i]$ if $[c_j, d_j] \subseteq [a_i, b_i]$. Then there exists a matching of size $k - 3$ between the input and output micro-partitions (i.e., there are at most 3 unmatched output micro-partitions).

Figures (18)

  • Figure 1: Micro-partitions in Cloud Data Warehouses. A table is structured as multiple micro-partitions stored in the object storage layer. Each micro-partition comprises both metadata (e.g., zonemaps) and data blocks. During query execution, the metadata is used to prune irrelevant partitions.
  • Figure 2: Benefits of Clustering. The query predicated on Age must scan all partitions in a naturally ordered table, while clustering by Age allows pruning all but the first partition.
  • Figure 3: Overlapping Depth and Width. Number of micro-partitions overlapping a given micro-partition (width) and a given point (depth). By reclustering, both the overlapping width and depth decrease from three to one.
  • Figure 4: System Model with Reclustering. (a) An existing data pool containing previously stored data. (b) Incoming data is partitioned and ingested in its natural order. (c) Query workloads use metadata to fetch only relevant partitions for processing. (d) Partitions are fetched, sorted, and then persisted back into storage.
  • Figure 5: Partitions on the Boundary. Boundary partitions overlap with query range edges and primarily affect pruning efficiency.
  • ...and 13 more figures

Theorems & Definitions (8)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 4
  • Theorem 5