Table of Contents
Fetching ...

Query Refinement for Diverse Top-$k$ Selection

Felix S. Campbell, Alon Silberstein, Julia Stoyanovich, Yuval Moskovitch

TL;DR

A mixed-integer linear programming (MILP) based solution for modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined notion of diversity while simultaneously maintaining the intent of the original query is proposed.

Abstract

Database queries are often used to select and rank items as decision support for many applications. As automated decision-making tools become more prevalent, there is a growing recognition of the need to diversify their outcomes. In this paper, we define and study the problem of modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined notion of diversity while simultaneously maintaining the intent of the original query. We show the hardness of this problem and propose a Mixed Integer Linear Programming (MILP) based solution. We further present optimizations designed to enhance the scalability and applicability of the solution in real-life scenarios. We investigate the performance characteristics of our algorithm and show its efficiency and the usefulness of our optimizations.

Query Refinement for Diverse Top-$k$ Selection

TL;DR

A mixed-integer linear programming (MILP) based solution for modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined notion of diversity while simultaneously maintaining the intent of the original query is proposed.

Abstract

Database queries are often used to select and rank items as decision support for many applications. As automated decision-making tools become more prevalent, there is a growing recognition of the need to diversify their outcomes. In this paper, we define and study the problem of modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined notion of diversity while simultaneously maintaining the intent of the original query. We show the hardness of this problem and propose a Mixed Integer Linear Programming (MILP) based solution. We further present optimizations designed to enhance the scalability and applicability of the solution in real-life scenarios. We investigate the performance characteristics of our algorithm and show its efficiency and the usefulness of our optimizations.
Paper Structure (17 sections, 4 theorems, 20 equations, 9 figures, 6 tables)

This paper contains 17 sections, 4 theorems, 20 equations, 9 figures, 6 tables.

Key Result

theorem 1

There exists a dataset $D$, a query $Q$ over $D$, and a constraints set $\mathcal{C}$ such that no refinement of $Q$ satisfies $\mathcal{C}$.

Figures (9)

  • Figure 1: Summary of our MILP model
  • Figure 2: Diagram illustrating the expression generation for our running example. The predicate Activity = 'RB' AND GPA $\geq$ 3.7 generates the variables $Activity_{SO}$ and $GPA_{3.7, \geq}$ as 'SO' and $3.7$ are values that appear for those attributes respectively in the database $D$. $C_{GPA, \geq}$ is also generated by the predicate to hold the new constant of the predicate, and constrains the value of $GPA_{3.7, \geq}$ by (\ref{['eq:value_bounds_inline']}). The tuple $t_6\in \widetilde{Q}$ generates the variable $r_6$, whose value is constrained through (\ref{['eq:tuple_in_ranking_inline']}) by the values of $Activity_{SO}$ and $GPA_{3.7, \geq}$ due to its lineage. It also generates the variable $s_6$, which is then constrained by the value of the $r_t$ values for the tuples that rank better than it, i.e., $r_{t_{1..5}}$, through (\ref{['eq:position_in_ranking_inline']}). Finally, the constraint $\ell_{Gender='Female', k=6} = 3$ combines with $t_6$ to generate the variable $l_{t_{6},6}$ which is constrained by the value of $s_{t_6}$ by (\ref{['eq:in_prefix_inline']}). The constraint generates the variable $E_{Gender='Female',6}$ which is constrained through (\ref{['eq:tuples_to_satisfy_inline']}) by the values of all the $l_{t, 6}$ variables for which $t$ is a part of the group (listed in \ref{['ex:deviation']}).
  • Figure 3: Running time of compared algorithms, for cases where computation completed within a 1-hour timeout (method or distance omitted when timed out). MILP+opt consistently outperforms other methods.
  • Figure 4: Running time vs. $k^*$, showing $DIS_{pred}$ is often the fastest to compute, while $DIS_{Kendall}$ can be sensitive to increasing $k^*$.
  • Figure 5: Running time vs. maximum deviation ($\varepsilon$), showing that the effect of $\varepsilon$ is limited.
  • ...and 4 more figures

Theorems & Definitions (6)

  • theorem 1
  • definition 1: Deviation
  • definition 2: Best Approximation Refinement
  • theorem 2
  • lemma 1
  • theorem 3: Solution correctness