DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking
Guanli Liu, Renata Borovica-Gajic
TL;DR
The paper tackles the lack of principled, reproducible drift in benchmarks by formalizing data drift and workload drift, introducing four representative operations for each, plus four temporal drift patterns. It then presents DriftBench, a modular framework that ingests tabular data (CSV or PostgreSQL), extracts schema, and generates drifted data and workload templates along with timestamped drift scenarios. Through case studies on a census dataset, it demonstrates drift injection for data and query workloads and assesses drift-aware cardinality estimation using multiple estimators. The work provides a practical, extensible foundation for evaluating database components under dynamic, time-evolving conditions, with potential impact on caching, indexing, and learned components across systems.
Abstract
Data and workload drift are key to evaluating database components such as caching, cardinality estimation, indexing, and query optimization. Yet, existing benchmarks are static, offering little to no support for modeling drift. This limitation stems from the lack of clear definitions and tools for generating data and workload drift. Motivated by this gap, we propose a unified taxonomy for data and workload drift, grounded in observations from both academia and industry. Building on this foundation, we introduce DriftBench, a lightweight and extensible framework for generating data and workload drift in benchmark inputs. Together, the taxonomy and DriftBench provide a standardized vocabulary and mechanism for modeling and generating drift in benchmarking. We demonstrate their effectiveness through case studies involving data drift, workload drift, and drift-aware cardinality estimation.
