A Unified Approach for Multi-Granularity Search over Spatial Datasets
Wenzhe Yang, Sheng Wang, Shixun Huang, Hao Liu, Yuan Sun, Juliana Freire, Zhiyong Peng
TL;DR
The paper tackles the challenge of unified multi-granularity spatial search by introducing Spadas, a two-level unified index that simultaneously supports dataset-range, exemplar dataset, and data point searches with multiple distance metrics. It combines a bottom-level index per dataset with an upper-level data repository index, augmented by parameter-free outlier removal and bottom-up refinement to mitigate outlier impact. The authors propose fast bound estimation and error-bounded approximate Hausdorff computations to accelerate top-$k$ exemplar searches, along with batch pruning strategies, achieving orders-of-magnitude speedups over state-of-the-art baselines in six real-world repositories. An online Spadas system and a comprehensive experimental evaluation demonstrate strong performance, robustness to outliers, and scalability to large, high-dimensional spatial data collections.
Abstract
There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.
