Independent Range Sampling on Interval Data (Longer Version)

Daichi Amagata

Independent Range Sampling on Interval Data (Longer Version)

Daichi Amagata

TL;DR

This work tackles independent range sampling (IRS) for interval data, where $s$ random samples from $q \cap X$ must be independent of prior queries. It introduces the Augmented Interval Tree (AIT), and its variants AIT-V and AWIT, to achieve near-$s$ time for non-weighted sampling and $\tilde{O}(s)$ time with $O(n)$ space in expectation for space-efficient and weighted scenarios. The core results show $O(\log^{2} n + s)$ time for non-weighted IRS and $O(\log^{2} n + s \log n)$ for weighted IRS, with $O(n \log n)$ space (and $O(n)$ in expectation for AIT-V). Empirical evaluations on real datasets demonstrate substantial speedups over state-of-the-art range-search approaches, validating the practicality of sampling-based analytics over large interval datasets.

Abstract

Many applications require efficient management of large sets of intervals because many objects are associated with intervals (e.g., time and price intervals). In such interval management systems, range search is a primitive operator for retrieving and analysis tasks. As dataset sizes are growing nowadays, range search results are also becoming larger, which may overwhelm users and incur long computation time. Because applications are usually satisfied with a subset of the result set, it is desirable to efficiently obtain only small samples from the result set.We therefore address the problem of independent range sampling on interval data, which outputs $s$ random samples that overlap a given query interval and are independent of the samples of all previous queries. To efficiently solve this problem theoretically and practically, we propose a variant of an interval tree, namely the augmented interval tree (or AIT), and we show that there exists an exact algorithm that needs $O(n \log n)$ space and $O(\log^{2} n + s)$ time, where $n$ is the dataset size. The simple structure of an AIT provides flexible extensions: (i) its time and space complexities respectively become $O(\log^{2} n + s)$ expected and $O(n)$ by bucketing intervals and (ii) it can deal with weighted intervals and outputs $s$ weighted random samples in $O(\log^{2} n+s\log n)$ time. We conduct extensive experiments on real datasets, and the results demonstrate that our algorithms significantly outperform competitors.

Independent Range Sampling on Interval Data (Longer Version)

TL;DR

This work tackles independent range sampling (IRS) for interval data, where

random samples from

must be independent of prior queries. It introduces the Augmented Interval Tree (AIT), and its variants AIT-V and AWIT, to achieve near-

time for non-weighted sampling and

time with

space in expectation for space-efficient and weighted scenarios. The core results show

time for non-weighted IRS and

for weighted IRS, with

space (and

in expectation for AIT-V). Empirical evaluations on real datasets demonstrate substantial speedups over state-of-the-art range-search approaches, validating the practicality of sampling-based analytics over large interval datasets.

Abstract

random samples that overlap a given query interval and are independent of the samples of all previous queries. To efficiently solve this problem theoretically and practically, we propose a variant of an interval tree, namely the augmented interval tree (or AIT), and we show that there exists an exact algorithm that needs

space and

time, where

is the dataset size. The simple structure of an AIT provides flexible extensions: (i) its time and space complexities respectively become

expected and

by bucketing intervals and (ii) it can deal with weighted intervals and outputs

weighted random samples in

time. We conduct extensive experiments on real datasets, and the results demonstrate that our algorithms significantly outperform competitors.

Paper Structure (25 sections, 8 theorems, 10 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 8 theorems, 10 figures, 10 tables, 1 algorithm.

Introduction
Preliminary
Problem Definition
Interval tree
Weighted Sampling Methods
Non-weighted Interval Case
AIT: Augmented Interval Tree
Structure
Construction
IRS Algorithm for Non-weighted Intervals
Observation
Algorithm Description
Analysis
Reducing the Space Complexity
Updates
...and 10 more sections

Key Result

Theorem 1

The space complexity of an AIT is $O(n\log n)$.

Figures (10)

Figure 1: Illustration of our main idea. Node $u_i$ of an interval tree maintains $\{x_1, x_2, x_3, x_4, x_5, x_6\}$, and $L^{l}_{i} = [x_1, x_2, x_3, x_4, x_5, x_6]$. A query $q$ overlaps $x_1$, $x_2$, $x_3$, and $x_4$.
Figure 2: Illustration of an AIT on a small dataset set $X = \{x_1, x_2, ..., x_{11}\}$. The top part depicts the AIT on $X$, whereas the bottom part shows $x_{i}$ ($i \in [1,11]$).
Figure 3: Illustration of case 3: $q.r \leq c_{i} \leq q.r$ in Fig. \ref{['fig:ait']} ($c_{i} = c_{root}$)
Figure 4: Distribution of intervals and rough z-curve (for Book dataset)
Figure 5: Pre-processing time [sec] and memory usage [GB] of AIT and AIT-V. $\diamond$, $\triangledown$, $+$, and $\triangle$ respectively show the result on Book, BTC, Renfe, and Taxi.
...and 5 more figures

Theorems & Definitions (13)

Remark 1
Example 1
Example 2
Theorem 1: Space complexity of AIT
Example 3
Theorem 2: Time complexity of Algorithm \ref{['algo:uniform']}
Theorem 3: Correctness of Algorithm \ref{['algo:uniform']}
Corollary 1: Time complexity of range counting on AIT
Definition 1: Virtual interval
Corollary 2
...and 3 more

Independent Range Sampling on Interval Data (Longer Version)

TL;DR

Abstract

Independent Range Sampling on Interval Data (Longer Version)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (13)