Table of Contents
Fetching ...

A Unified Approach for Multi-Granularity Search over Spatial Datasets

Wenzhe Yang, Sheng Wang, Shixun Huang, Hao Liu, Yuan Sun, Juliana Freire, Zhiyong Peng

TL;DR

The paper tackles the challenge of unified multi-granularity spatial search by introducing Spadas, a two-level unified index that simultaneously supports dataset-range, exemplar dataset, and data point searches with multiple distance metrics. It combines a bottom-level index per dataset with an upper-level data repository index, augmented by parameter-free outlier removal and bottom-up refinement to mitigate outlier impact. The authors propose fast bound estimation and error-bounded approximate Hausdorff computations to accelerate top-$k$ exemplar searches, along with batch pruning strategies, achieving orders-of-magnitude speedups over state-of-the-art baselines in six real-world repositories. An online Spadas system and a comprehensive experimental evaluation demonstrate strong performance, robustness to outliers, and scalability to large, high-dimensional spatial data collections.

Abstract

There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.

A Unified Approach for Multi-Granularity Search over Spatial Datasets

TL;DR

The paper tackles the challenge of unified multi-granularity spatial search by introducing Spadas, a two-level unified index that simultaneously supports dataset-range, exemplar dataset, and data point searches with multiple distance metrics. It combines a bottom-level index per dataset with an upper-level data repository index, augmented by parameter-free outlier removal and bottom-up refinement to mitigate outlier impact. The authors propose fast bound estimation and error-bounded approximate Hausdorff computations to accelerate top- exemplar searches, along with batch pruning strategies, achieving orders-of-magnitude speedups over state-of-the-art baselines in six real-world repositories. An online Spadas system and a comprehensive experimental evaluation demonstrate strong performance, robustness to outliers, and scalability to large, high-dimensional spatial data collections.

Abstract

There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.

Paper Structure

This paper contains 38 sections, 1 theorem, 5 equations, 26 figures, 3 tables, 3 algorithms.

Key Result

Lemma 1

Given a threshold $\epsilon$, the approximate algorithm guarantees an error of $2\epsilon$ for the Hausdorff distance.

Figures (26)

  • Figure 1: The illustration of our multi-granularity search research over spatial datasets, which seamlessly integrates the dataset search (including (1) and (2) based on three distance metrics) and data point search (including (3) and (4)).
  • Figure 2: The illustration of three distance metrics, where (a) shows that the IA of $Q$ and $D_1$ is the size of their minimum bounding rectangles' overlapping areas, i.e., the rectangle filled with red lines, (b) shows that the GBO of $Q$ and $D_1$ is 1 since there is one overlapping cell between $Q$ and $D_1$, and (c) shows that the Haus of $Q$ and $D_1$ is the maximum nearest neighbor distance, i.e., the distance from the point $p_1$ of $Q$ to the point $p_1'$ of $D_1$.
  • Figure 3: Architecture of Spadas.
  • Figure 4: An overview of our unified index.
  • Figure 5: The distributions of radii of leaf nodes before and after outlier removal on two data repositories.
  • ...and 21 more figures

Theorems & Definitions (14)

  • Example 1
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Definition 9
  • ...and 4 more