Low Rank Learning for Offline Query Optimization
Zixuan Yi, Yao Tian, Zachary G. Ives, Ryan Marcus
TL;DR
This work tackles offline query optimization by modeling a repetitive workload as a partially observed, low-rank matrix of plan latencies and steering the optimizer through hints using low-overhead linear methods. It introduces LimeQO, an offline exploration framework built on ALS-based matrix completion and an optional transductive TCNN neural variant, both designed to minimize offline cost while avoiding regressions. Key contributions include formalizing offline exploration as an active learning problem, handling censored observations from timeouts, and demonstrating substantial workload-latency reductions across multiple benchmarks with minimal offline overhead. The approach is demonstrated to be effective, robust to data shifts, and DBMS-agnostic, offering a practical alternative to heavyweight neural-based learned optimizers with strong real-world applicability.
Abstract
Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce \textsc{LimeQO}, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ``hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS. The code is available in \href{https://github.com/zixy17/LimeQO}{https://github.com/zixy17/LimeQO}.
