Table of Contents
Fetching ...

climber++: Pivot-Based Approximate Similarity Search over Big Data Series

Liang Zhang, Mohamed Y. Eltabakh, Elke A. Rundensteiner, Khalid Alnuaim

TL;DR

CLIMBER addresses the critical challenge of achieving high-accuracy approximate kNN search over terabyte-scale time-series data without sacrificing scalability. It introduces a loss-resistant dual pivot representation and a two-level index that jointly enable coarse-grouping and fine-grained partitioning, along with two query strategies (CLIMBER-kNN and CLIMBER-kNN-Adaptive) that sustain recall well above 0.8 in experiments. The approach relies on Piecewise Aggregate Approximation to reduce dimensionality, Pivot Permutation Prefix-based signatures, and two novel distance metrics (Overlap Distance and Weight Distance) to guide grouping and partitioning. Experimental results on real-world and benchmark datasets demonstrate superior recall and competitive query times compared with state-of-the-art disk-based and memory-based baselines, while maintaining scalability to multi-terabyte data Reusability and practical deployment considerations are highlighted through a Spark-based prototype and detailed parameter studies. The work thus offers a practical framework for accurate, distributed similarity search in large-scale time-series analytics, with potential impact across sciences, IoT, finance, and web applications.

Abstract

The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics services on data series. Unfortunately, these techniques are heavily geared towards achieving scalability at the cost of sacrificing the results' accuracy. State-of-the-art systems report accuracy below 10% and 40%, respectively, which is not practical for many real-world applications. In this paper, we investigate the root problems in these existing techniques that limit their ability to achieve better a trade-off between scalability and accuracy. Then, we propose a framework, called CLIMBER, that encompasses a novel feature extraction mechanism, indexing scheme, and query processing algorithms for supporting approximate similarity search in big data series. For CLIMBER, we propose a new loss-resistant dual representation composed of rank-sensitive and ranking-insensitive signatures capturing data series objects. Based on this representation, we devise a distributed two-level index structure supported by an efficient data partitioning scheme. Our similarity metrics tailored for this dual representation enables meaningful comparison and distance evaluation between the rank-sensitive and ranking-insensitive signatures. Finally, we propose two efficient query processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering approximate kNN similarity queries. Our experimental study on real-world and benchmark datasets demonstrates that CLIMBER, unlike existing techniques, features results' accuracy above 80% while retaining the desired scalability to terabytes of data.

climber++: Pivot-Based Approximate Similarity Search over Big Data Series

TL;DR

CLIMBER addresses the critical challenge of achieving high-accuracy approximate kNN search over terabyte-scale time-series data without sacrificing scalability. It introduces a loss-resistant dual pivot representation and a two-level index that jointly enable coarse-grouping and fine-grained partitioning, along with two query strategies (CLIMBER-kNN and CLIMBER-kNN-Adaptive) that sustain recall well above 0.8 in experiments. The approach relies on Piecewise Aggregate Approximation to reduce dimensionality, Pivot Permutation Prefix-based signatures, and two novel distance metrics (Overlap Distance and Weight Distance) to guide grouping and partitioning. Experimental results on real-world and benchmark datasets demonstrate superior recall and competitive query times compared with state-of-the-art disk-based and memory-based baselines, while maintaining scalability to multi-terabyte data Reusability and practical deployment considerations are highlighted through a Spark-based prototype and detailed parameter studies. The work thus offers a practical framework for accurate, distributed similarity search in large-scale time-series analytics, with potential impact across sciences, IoT, finance, and web applications.

Abstract

The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics services on data series. Unfortunately, these techniques are heavily geared towards achieving scalability at the cost of sacrificing the results' accuracy. State-of-the-art systems report accuracy below 10% and 40%, respectively, which is not practical for many real-world applications. In this paper, we investigate the root problems in these existing techniques that limit their ability to achieve better a trade-off between scalability and accuracy. Then, we propose a framework, called CLIMBER, that encompasses a novel feature extraction mechanism, indexing scheme, and query processing algorithms for supporting approximate similarity search in big data series. For CLIMBER, we propose a new loss-resistant dual representation composed of rank-sensitive and ranking-insensitive signatures capturing data series objects. Based on this representation, we devise a distributed two-level index structure supported by an efficient data partitioning scheme. Our similarity metrics tailored for this dual representation enables meaningful comparison and distance evaluation between the rank-sensitive and ranking-insensitive signatures. Finally, we propose two efficient query processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering approximate kNN similarity queries. Our experimental study on real-world and benchmark datasets demonstrates that CLIMBER, unlike existing techniques, features results' accuracy above 80% while retaining the desired scalability to terabytes of data.
Paper Structure (18 sections, 6 equations, 12 figures, 1 table, 3 algorithms)

This paper contains 18 sections, 6 equations, 12 figures, 1 table, 3 algorithms.

Figures (12)

  • Figure 1: Examples of SAX and iSAX Repr. ($w=4, c=8$).
  • Figure 2: Recursive Voronoi partitioning. Fragments are labeled by the corresponding pivot or the sequences of pivots.
  • Figure 3: PAA segmentation of a data series.
  • Figure 4: $p^4$ Signatures where $\#$ of prefix pivots $m=3$.
  • Figure 5: -INX index skeleton: groups (the $1^{st}$ Level) and trie-based partitions (the $2^{nd}$ level).
  • ...and 7 more figures

Theorems & Definitions (15)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Definition 9
  • Definition 10
  • ...and 5 more