On the feasibility of semantic query metrics
George Fletcher, Peter Wood, Nikolay Yakovets
TL;DR
This work tackles the problem of defining semantic distances between relational queries by anchoring distances to containment-based paths. It first proves that no meaningful semantic metric exists for the general CQ class due to non-finiteness of maximal containment, then proposes a nontrivial restriction to 2CQs in which a well-defined, metric containment-based distance can be defined and computed. The authors show that maximal containment for 2CQs is decidable in polynomial time and provide a graph-based framework (MC-graph) to compute distances, albeit with exponential-time complexity in certain parameters. The result offers a principled, semantics-driven approach to measuring query similarity within a practically relevant fragment, enabling potential applications in workload analysis and query optimization, and sets the stage for further algorithmic and empirical investigations, including extensions to constraints.
Abstract
We consider the problem of defining semantic metrics for relational database queries. Informally, a semantic query metric for a query language $L$ is a metric function $δ:L\times L\to \mathbb{N}$ where $δ(Q_1, Q_2)$ represents the length of a shortest path between queries $Q_1$ and $Q_2$ in a graph. In this graph, nodes are queries from $L$, and edges connect semantically distinct queries where one query is maximally semantically contained in the other. Since query containment is undecidable for first-order queries, we focus on the simpler language of conjunctive queries. We establish that defining a semantic query metric is impossible even for conjunctive queries. Given this impossibility result, we identify a significant subclass of conjunctive queries where such a metric is feasible, and we establish the computational complexity of calculating distances within this language.
