Mining Area Skyline Objects from Map-based Big Data using Apache Spark Framework
Chen Li, Ye Zhu, Yang Cao, Jinli Zhang, Annisa Annisa, Debo Cheng, Yasuhiko Morimoto
TL;DR
The paper tackles the computational intensity of area skyline queries on map-based big data. It introduces an Apache Spark-based distributed algorithm that employs three key techniques—local partial skyline extraction, driver-side filter creation, and executor-side filtering—to reduce intermediate data and accelerate skyline computations. Empirical results on eight synthetic datasets show substantial reductions in execution time and data volume, with gains increasing as grid sizes and facility counts grow, demonstrating the method's scalability and practical relevance for location-based decision support. The work highlights Spark's suitability for large-scale, multi-criteria spatial queries and points to real-world deployments in spatial decision-making and related domains.
Abstract
The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data expands. This study presents a novel algorithm aimed at mitigating this challenge by harnessing the capabilities of Apache Spark, a distributed processing platform, for conducting area skyline computations. The proposed algorithm enhances processing speed and scalability. In particular, our algorithm encompasses three key phases: the computation of distances between data points, the generation of distance tuples, and the execution of the skyline operators. Notably, the second phase employs a local partial skyline extraction technique to minimize the volume of data transmitted from each executor (a parallel processing procedure) to the driver (a central processing procedure). Afterwards, the driver processes the received data to determine the final skyline and creates filters to exclude irrelevant points. Extensive experimentation on eight datasets reveals that our algorithm significantly reduces both data size and computation time required for area skyline computation.
