Joinable Search over Multi-source Spatial Datasets: Overlap, Coverage, and Efficiency
Wenzhe Yang, Sheng Wang, Zhiyu Chen, Yuan Sun, Zhiyong Peng
TL;DR
This work tackles joinable search over multi-source spatial datasets by defining two problems: overlap joinable search ($OJSP$) and coverage joinable search ($CJSP$). It introduces the DIstributed Tree-based Spatial index (DITS), combining a global index and per-source local indices to accelerate both problems while minimizing inter-source communication. CJSP is proven NP-hard, and the authors provide a greedy $(1-1/e)$-approximation algorithm (CoverageSearch) for practical use, along with an exact method for $OJSP$ (OverlapSearch) leveraging pruning with upper/lower bounds. Empirical evaluation on five real data sources demonstrates substantial reductions in running time and communication costs compared with baselines, validating both the indexing framework and the search algorithms. The framework enables scalable, cross-source spatial data discovery and integration, with potential extensions to data pricing and dynamic data updates.
Abstract
The search for joinable data is pivotal for numerous applications, such as data integration, data augmentation, and data analysis. Although there have been many successful joinable search studies for table discovery, the study of finding joinable spatial datasets for a given query from multiple spatial data sources has not been well considered. This paper studies two cases of joinable search problems from multiple spatial data sources. In addition to the overlap joinable search problem (OJSP), we also propose a novel coverage joinable search problem (CJSP) that has not been considered before, motivated by many real-world applications in the field of spatial search. To support two cases of joinable search over multiple spatial data sources seamlessly, we propose a multi-source spatial dataset search framework. Firstly, we design a DIstributed Tree-based Spatial index structure called DITS, which is used not only to design acceleration strategies to speed up joinable searches, but also to support efficient communication between multiple data sources. Additionally, we prove that the CJSP is NP-hard and design a greedy approximate algorithm to solve the problem. We evaluate the efficiency of our search framework on five real-world data sources, and the experimental results show that our framework can significantly reduce running time and communication costs compared with baselines.
