Table of Contents
Fetching ...

SOLAR: Scalable Distributed Spatial Joins through Learning-based Optimization

Yongyi Liu, Ahmed Mahmood, Amr Magdy, Minyao Zhu

TL;DR

SOLAR addresses redundant partitioning in distributed spatial joins by learning dataset embeddings in an offline phase and reusing partitions in an online phase. It employs a Siamese neural network to map dataset metadata embeddings to a similarity space and a random forest decision model to decide partitioner reuse, enabling efficient handling of unseen joins. Empirically, SOLAR achieves up to 3.6X overall join speedup and 2.71X partitioning speedup over state-of-the-art baselines across real-world datasets. This reuse-driven approach reduces partitioning overhead and demonstrates a scalable strategy for self-improving spatial data systems.

Abstract

The proliferation of location-based services has led to massive spatial data generation. Spatial join is a crucial database operation that identifies pairs of objects from two spatial datasets based on spatial relationships. Due to the intensive computational demands, spatial joins are often executed in a distributed manner across clusters. However, current systems fail to recognize similarities in the partitioning of spatial data, leading to redundant computations and increased overhead. Recently, incorporating machine learning optimizations into database operations has enhanced efficiency in traditional joins by predicting optimal strategies. However, applying these optimizations to spatial joins poses challenges due to the complex nature of spatial relationships and the variability of spatial data. This paper introduces SOLAR, scalable distributed spatial joins through learning-based optimization. SOLAR operates through offline and online phases. In the offline phase, it learns balanced spatial partitioning based on the similarities between datasets in query workloads seen so far. In the online phase, when a new join query is received, SOLAR evaluates the similarity between the datasets in the new query and the already-seen workloads using the trained learning model. Then, it decides to either reuse an existing partitioner, avoiding unnecessary computational overhead, or partition from scratch. Our extensive experimental evaluation on real-world datasets demonstrates that SOLAR achieves up to 3.6X speedup in overall join runtime and 2.71X speedup in partitioning time compared to state-of-the-art systems.

SOLAR: Scalable Distributed Spatial Joins through Learning-based Optimization

TL;DR

SOLAR addresses redundant partitioning in distributed spatial joins by learning dataset embeddings in an offline phase and reusing partitions in an online phase. It employs a Siamese neural network to map dataset metadata embeddings to a similarity space and a random forest decision model to decide partitioner reuse, enabling efficient handling of unseen joins. Empirically, SOLAR achieves up to 3.6X overall join speedup and 2.71X partitioning speedup over state-of-the-art baselines across real-world datasets. This reuse-driven approach reduces partitioning overhead and demonstrates a scalable strategy for self-improving spatial data systems.

Abstract

The proliferation of location-based services has led to massive spatial data generation. Spatial join is a crucial database operation that identifies pairs of objects from two spatial datasets based on spatial relationships. Due to the intensive computational demands, spatial joins are often executed in a distributed manner across clusters. However, current systems fail to recognize similarities in the partitioning of spatial data, leading to redundant computations and increased overhead. Recently, incorporating machine learning optimizations into database operations has enhanced efficiency in traditional joins by predicting optimal strategies. However, applying these optimizations to spatial joins poses challenges due to the complex nature of spatial relationships and the variability of spatial data. This paper introduces SOLAR, scalable distributed spatial joins through learning-based optimization. SOLAR operates through offline and online phases. In the offline phase, it learns balanced spatial partitioning based on the similarities between datasets in query workloads seen so far. In the online phase, when a new join query is received, SOLAR evaluates the similarity between the datasets in the new query and the already-seen workloads using the trained learning model. Then, it decides to either reuse an existing partitioner, avoiding unnecessary computational overhead, or partition from scratch. Our extensive experimental evaluation on real-world datasets demonstrates that SOLAR achieves up to 3.6X speedup in overall join runtime and 2.71X speedup in partitioning time compared to state-of-the-art systems.

Paper Structure

This paper contains 28 sections, 9 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: Example of SOLAR Reusing Existing Partitioners in Executinng Spatial Join Queries
  • Figure 2: Overall Execution Flow of Distributed Spatial Join
  • Figure 3: Data Histogram Construction
  • Figure 4: Dataset Embedding
  • Figure 5: Siamese Neural Network
  • ...and 5 more figures