Table of Contents
Fetching ...

Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue

Marcelo Arenas, Timo Camillo Merkl, Reinhard Pichler, Cristian Riveros

TL;DR

This work studies how to extract a small, representative, and diverse subset of query answers when the full result set is potentially enormous. It advocates ultrametrics as a principled foundation for diversity, showing that if a diversity function $\delta$ extends an ultrametric and is (weakly) subset-monotone, the $k$-diverse subset problem becomes tractable in many practically relevant cases, including explicit and implicit representations and for acyclic conjunctive queries. The paper provides NP-hardness results that delineate the boundaries of tractability and presents efficient algorithms, such as a bottom-up dynamic program on the ultrametric tree for explicit representations and a greedy top-down approach for implicit representations, plus quasilinear-time ACQ evaluation under the $u_{\text{rel}}$ metric. It also demonstrates how these techniques yield scalable methods for obtaining diverse answers to ACQs under various settings, including restrictions like the absence of disruptive trios. Overall, the results map the landscape of when diversity-based output is computationally feasible and offer practical algorithms for producing diverse query results in real systems.

Abstract

The set of answers to a query may be very large, potentially overwhelming users when presented with the entire set. In such cases, presenting only a small subset of the answers to the user may be preferable. A natural requirement for this subset is that it should be as diverse as possible to reflect the variety of the entire population. To achieve this, the diversity of a subset is measured using a metric that determines how different two solutions are and a diversity function that extends this metric from pairs to sets. In the past, several studies have shown that finding a diverse subset from an explicitly given set is intractable even for simple metrics (like Hamming distance) and simple diversity functions (like summing all pairwise distances). This complexity barrier becomes even more challenging when trying to output a diverse subset from a set that is only implicitly given such as the query answers of a query and a database. Until now, tractable cases have been found only for restricted problems and particular diversity functions. To overcome these limitations, we focus on the notion of ultrametrics, which have been widely studied and used in many applications. Starting from any ultrametric $d$ and a diversity function $δ$ extending $d$, we provide sufficient conditions over $δ$ for having polynomial-time algorithms to construct diverse answers. To the best of our knowledge, these conditions are satisfied by all diversity functions considered in the literature. Moreover, we complement these results with lower bounds that show specific cases when these conditions are not satisfied and finding diverse subsets becomes intractable. We conclude by applying these results to the evaluation of conjunctive queries, demonstrating efficient algorithms for finding a diverse subset of solutions for acyclic conjunctive queries when the attribute order is used to measure diversity.

Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue

TL;DR

This work studies how to extract a small, representative, and diverse subset of query answers when the full result set is potentially enormous. It advocates ultrametrics as a principled foundation for diversity, showing that if a diversity function extends an ultrametric and is (weakly) subset-monotone, the -diverse subset problem becomes tractable in many practically relevant cases, including explicit and implicit representations and for acyclic conjunctive queries. The paper provides NP-hardness results that delineate the boundaries of tractability and presents efficient algorithms, such as a bottom-up dynamic program on the ultrametric tree for explicit representations and a greedy top-down approach for implicit representations, plus quasilinear-time ACQ evaluation under the metric. It also demonstrates how these techniques yield scalable methods for obtaining diverse answers to ACQs under various settings, including restrictions like the absence of disruptive trios. Overall, the results map the landscape of when diversity-based output is computationally feasible and offer practical algorithms for producing diverse query results in real systems.

Abstract

The set of answers to a query may be very large, potentially overwhelming users when presented with the entire set. In such cases, presenting only a small subset of the answers to the user may be preferable. A natural requirement for this subset is that it should be as diverse as possible to reflect the variety of the entire population. To achieve this, the diversity of a subset is measured using a metric that determines how different two solutions are and a diversity function that extends this metric from pairs to sets. In the past, several studies have shown that finding a diverse subset from an explicitly given set is intractable even for simple metrics (like Hamming distance) and simple diversity functions (like summing all pairwise distances). This complexity barrier becomes even more challenging when trying to output a diverse subset from a set that is only implicitly given such as the query answers of a query and a database. Until now, tractable cases have been found only for restricted problems and particular diversity functions. To overcome these limitations, we focus on the notion of ultrametrics, which have been widely studied and used in many applications. Starting from any ultrametric and a diversity function extending , we provide sufficient conditions over for having polynomial-time algorithms to construct diverse answers. To the best of our knowledge, these conditions are satisfied by all diversity functions considered in the literature. Moreover, we complement these results with lower bounds that show specific cases when these conditions are not satisfied and finding diverse subsets becomes intractable. We conclude by applying these results to the evaluation of conjunctive queries, demonstrating efficient algorithms for finding a diverse subset of solutions for acyclic conjunctive queries when the attribute order is used to measure diversity.
Paper Structure (40 sections, 13 theorems, 62 equations, 1 figure, 2 algorithms)

This paper contains 40 sections, 13 theorems, 62 equations, 1 figure, 2 algorithms.

Key Result

Theorem 3.1

The $\mathtt{DiversityComputation}[\delta_{\operatorname{W}}]$ problem of the Weitzman diversity function $\delta_{\operatorname{W}}$ is $\mathsf{NP}$-hard.

Figures (1)

  • Figure 1: On the left, a relation $\texttt{CARS}$ where each tuple is a car model. On the right, the ultrametric tree of the ultrametric $\textbf{u}_{\operatorname{rel}}$ over the tuples $S$ in $\texttt{CARS}$. On one side of each ball $B$ (in grey) we display its radius $\operatorname{r}_S(B)$.

Theorems & Definitions (37)

  • Theorem 3.1
  • proof : Proof Sketch
  • Theorem 3.2
  • proof : Proof Sketch
  • Example 4.1
  • Example 4.2
  • Example 4.3
  • Proposition 5.1
  • Theorem 5.2
  • proof : Proof Sketch of Theorem \ref{['theo:explicit-rep']}
  • ...and 27 more