Faster Relational Algorithms Using Geometric Data Structures

Aryan Esmailpour; Stavros Sintos

Faster Relational Algorithms Using Geometric Data Structures

Aryan Esmailpour, Stavros Sintos

Abstract

Optimization tasks over relational data, such as clustering, often suffer from the prohibitive cost of join operations, which are necessary to access the full dataset. While geometric data structures like BBD trees yield fast approximation algorithms in the standard computational setting, their application to relational data remains unclear due to the size of the join output. In this paper, we introduce a framework that leverages geometric insights to design faster algorithms when the data is stored as the results of a join query in a relational database. Our core contribution is the development of the RBBD tree, a randomized variant of the BBD tree tailored for relational settings. Instead of completely constructing the RBBD tree, by leveraging efficient sampling and counting techniques over relational joins, we enable on-the-fly efficient expansion of the RBBD tree, maintaining only the necessary parts. This allows us to simulate geometric query procedures without materializing the join result. As an application, we present algorithms that improve the state-of-the-art for relational $k$-center/means/median clustering by a factor of $k$ in running time while maintaining the same approximation guarantees. Our method is general and can be applied to various optimization problems in the relational setting.

Faster Relational Algorithms Using Geometric Data Structures

Abstract

-center/means/median clustering by a factor of

in running time while maintaining the same approximation guarantees. Our method is general and can be applied to various optimization problems in the relational setting.

Paper Structure (80 sections, 28 theorems, 2 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 80 sections, 28 theorems, 2 equations, 7 figures, 1 table, 2 algorithms.

Introduction
Notation and problem definition
Clustering
Relative approximation
Coreset
Remark
Related work
Our results
Preliminaries
Box
$\varepsilon$-sample
Oracles for aggregation queries
BBD tree
Overview
Previous methods
...and 65 more sections

Key Result

lemma 1

Let $\mathbf{D}$ be a database instance of $N$ tuples (with numerical attributes) and let $\boldsymbol{q}$ be an acyclic query over $\mathbf{D}$. Let $\rho$ be a box in $\mathbb{R}^d$. There exists an algorithm The running times of the oracles hold assuming perfect hashing.We assume perfect hashing for simplicity. A standard randomized hash table yields the same asymptotic bounds with high probab

Figures (7)

Figure 1: Partition of a partially constructed BBD tree over the set of black points. Red circle represents the query ball $\mathcal{B}(x,r)$ while the larger dashed circle represents $\mathcal{B}(x,(1+\varepsilon)r)$.
Figure 2: Partially constructed BBD tree over the black points of Figure \ref{['fig:part']}. A node $u$ of the form $[X,Y], c$ has a region $\square_u$ defined by the opposite corners $X, Y$, and $|P_u|=|\square_u\cap P|=c$. Nodes with $[X,Y]^*,c$ correspond to nodes whose region is a box with a hole. The triangles below certain nodes denote subtrees omitted for simplicity and clarity. Red nodes indicate the set of canonical nodes $\mathcal{U}(x,r)$ corresponding to the ball $\mathcal{B}(x,r)$ shown in Figure \ref{['fig:part']}.
Figure 3: The algorithm first arbitrarily selects the red center (left). Using the BBD tree, it identifies the canonical nodes (three red dashed rectangles), which contain all points within distance $2r$ from the red center and three points within distance $(1+\varepsilon)2r$. These canonical nodes are then marked as inactive. Next, the algorithm selects the green point (right), since it lies in an active region, and computes the canonical node (green dashed rectangle) within distance $2r$, corresponding to the green dashed rectangle, which is subsequently marked as inactive. In the end, all points lie in inactive regions, so the algorithm terminates, returning a $(2+\varepsilon)$-approximation.
Figure 4: All the cells are midpoint boxes.
Figure 5: Illustration of finding the children $v, w$ of a node $u$ by the centroid shrink process. The region $\square_v$ is a box with a hole while $\square_w$ is a box.
...and 2 more figures

Theorems & Definitions (30)

definition 1
lemma 1: esmailpour2024improved
theorem 1
lemma 2
lemma 3
lemma 4
lemma 5
lemma 6
lemma 7
theorem 2
...and 20 more

Faster Relational Algorithms Using Geometric Data Structures

Abstract

Faster Relational Algorithms Using Geometric Data Structures

Authors

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (30)