Table of Contents
Fetching ...

DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking

Guanli Liu, Renata Borovica-Gajic

TL;DR

The paper tackles the lack of principled, reproducible drift in benchmarks by formalizing data drift and workload drift, introducing four representative operations for each, plus four temporal drift patterns. It then presents DriftBench, a modular framework that ingests tabular data (CSV or PostgreSQL), extracts schema, and generates drifted data and workload templates along with timestamped drift scenarios. Through case studies on a census dataset, it demonstrates drift injection for data and query workloads and assesses drift-aware cardinality estimation using multiple estimators. The work provides a practical, extensible foundation for evaluating database components under dynamic, time-evolving conditions, with potential impact on caching, indexing, and learned components across systems.

Abstract

Data and workload drift are key to evaluating database components such as caching, cardinality estimation, indexing, and query optimization. Yet, existing benchmarks are static, offering little to no support for modeling drift. This limitation stems from the lack of clear definitions and tools for generating data and workload drift. Motivated by this gap, we propose a unified taxonomy for data and workload drift, grounded in observations from both academia and industry. Building on this foundation, we introduce DriftBench, a lightweight and extensible framework for generating data and workload drift in benchmark inputs. Together, the taxonomy and DriftBench provide a standardized vocabulary and mechanism for modeling and generating drift in benchmarking. We demonstrate their effectiveness through case studies involving data drift, workload drift, and drift-aware cardinality estimation.

DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking

TL;DR

The paper tackles the lack of principled, reproducible drift in benchmarks by formalizing data drift and workload drift, introducing four representative operations for each, plus four temporal drift patterns. It then presents DriftBench, a modular framework that ingests tabular data (CSV or PostgreSQL), extracts schema, and generates drifted data and workload templates along with timestamped drift scenarios. Through case studies on a census dataset, it demonstrates drift injection for data and query workloads and assesses drift-aware cardinality estimation using multiple estimators. The work provides a practical, extensible foundation for evaluating database components under dynamic, time-evolving conditions, with potential impact on caching, indexing, and learned components across systems.

Abstract

Data and workload drift are key to evaluating database components such as caching, cardinality estimation, indexing, and query optimization. Yet, existing benchmarks are static, offering little to no support for modeling drift. This limitation stems from the lack of clear definitions and tools for generating data and workload drift. Motivated by this gap, we propose a unified taxonomy for data and workload drift, grounded in observations from both academia and industry. Building on this foundation, we introduce DriftBench, a lightweight and extensible framework for generating data and workload drift in benchmark inputs. Together, the taxonomy and DriftBench provide a standardized vocabulary and mechanism for modeling and generating drift in benchmarking. We demonstrate their effectiveness through case studies involving data drift, workload drift, and drift-aware cardinality estimation.

Paper Structure

This paper contains 21 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: Illustration of data drift.
  • Figure 2: Illustration of workload drift across time.
  • Figure 3: Four temporal drift patterns.
  • Figure 4: Architecture of DriftBench.
  • Figure 5: Data distributions under cardinality variation.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1: Data Drift
  • Definition 2: Workload Drift