Table of Contents
Fetching ...

Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

Tim C. Rese, Alexandra Kapp, David Bermbach

TL;DR

The paper addresses how index choice, data format, and dataset characteristics jointly affect spatial database performance for moving-object data, using PostGIS as a prototype. It introduces novel metrics for overlap and skew and scalable Monte Carlo approximations, and designs an application-driven benchmark with synthetic and real datasets to compare GiST, SP-GiST, and BRIN across segmented versus non-segmented trajectory representations in read/write workloads. Key findings show that data format and index choice substantially influence performance, with GiST typically delivering the best reads, BRIN excelling in writes, and a nuanced relationship between dataset overlap and benefit from segmentation; the average nearest neighbor distribution showed limited correlation in this study. The results offer practical guidance to developers for optimizing spatial storage and querying in moving-object scenarios and motivate extending the benchmark to additional DB systems and spatiotemporal workloads.

Abstract

The growing number of moving Internet-of-Things (IoT) devices has led to a surge in moving object data, powering applications such as traffic routing, hotspot detection, or weather forecasting. When managing such data, spatial database systems offer various index options and data formats, e.g., point-based or trajectory-based. Likewise, dataset characteristics such as geographic overlap and skew can vary significantly. All three significantly affect database performance. While this has been studied in existing papers, none of them explore the effects and trade-offs resulting from a combination of all three aspects. In this paper, we evaluate the performance impact of index choice, data format, and dataset characteristics on a popular spatial database system, PostGIS. We focus on two aspects of dataset characteristics, the degree of overlap and the degree of skew, and propose novel approximation methods to determine these features. We design a benchmark that compares a variety of spatial indexing strategies and data formats, while also considering the impact of dataset characteristics on database performance. We include a variety of real-world and synthetic datasets, write operations, and read queries to cover a broad range of scenarios that might occur during application runtime. Our results offer practical guidance for developers looking to optimize spatial storage and querying, while also providing insights into dataset characteristics and their impact on database performance.

Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

TL;DR

The paper addresses how index choice, data format, and dataset characteristics jointly affect spatial database performance for moving-object data, using PostGIS as a prototype. It introduces novel metrics for overlap and skew and scalable Monte Carlo approximations, and designs an application-driven benchmark with synthetic and real datasets to compare GiST, SP-GiST, and BRIN across segmented versus non-segmented trajectory representations in read/write workloads. Key findings show that data format and index choice substantially influence performance, with GiST typically delivering the best reads, BRIN excelling in writes, and a nuanced relationship between dataset overlap and benefit from segmentation; the average nearest neighbor distribution showed limited correlation in this study. The results offer practical guidance to developers for optimizing spatial storage and querying in moving-object scenarios and motivate extending the benchmark to additional DB systems and spatiotemporal workloads.

Abstract

The growing number of moving Internet-of-Things (IoT) devices has led to a surge in moving object data, powering applications such as traffic routing, hotspot detection, or weather forecasting. When managing such data, spatial database systems offer various index options and data formats, e.g., point-based or trajectory-based. Likewise, dataset characteristics such as geographic overlap and skew can vary significantly. All three significantly affect database performance. While this has been studied in existing papers, none of them explore the effects and trade-offs resulting from a combination of all three aspects. In this paper, we evaluate the performance impact of index choice, data format, and dataset characteristics on a popular spatial database system, PostGIS. We focus on two aspects of dataset characteristics, the degree of overlap and the degree of skew, and propose novel approximation methods to determine these features. We design a benchmark that compares a variety of spatial indexing strategies and data formats, while also considering the impact of dataset characteristics on database performance. We include a variety of real-world and synthetic datasets, write operations, and read queries to cover a broad range of scenarios that might occur during application runtime. Our results offer practical guidance for developers looking to optimize spatial storage and querying, while also providing insights into dataset characteristics and their impact on database performance.

Paper Structure

This paper contains 12 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Moving Object Data can be stored in various formats, such as simply storing the point data. One can also store segments of the trajectory separately (each color represents a separate entry in the database), or store the entire trajectory as one object.
  • Figure 2: Trajectory 1 and 2 have overlapping minimum bounding rectangles, but do not intersect.
  • Figure 3: Traj. 1's MBR is overlapping with Traj. 2 and 3, while Traj. 2 and 3 are not overlapping. In graph form, each trajectory is a node and possesses an edge to overlapping trajectories. The GOC of this dataset would then be 2/3.
  • Figure 4: These simplified trajectories show how we can apply ANN to trajectories. With a small number of trajectories, exactly calculating this value is still realistic. Including a large amount of trajectories necessitates an approximation to finish the calculation in a reasonable time. Our ANN approximation excludes points from the own trajectory.
  • Figure 5: We include 7 different datasets in our evaluation, with 4 synthetic and 3 real-world datasets. The real-world datasets are from the SimRa project, the Deutsche Flugsicherung, and the Piraeus AIS dataset.
  • ...and 5 more figures