Table of Contents
Fetching ...

LeaFi: Data Series Indexes on Steroids with Learned Filters

Qitong Wang, Ioana Ileana, Themis Palpanas

TL;DR

LeaFi introduces learned filters to improve pruning in tree-based data series indexes, addressing inefficiencies in traditional lower-bound pruning. By selecting a subset of leaf nodes, generating global and local training data, and calibrating predictions with conformal auto-tuners, LeaFi delivers up to 20x pruning and 32x faster searches while maintaining 99% recall. The framework is instantiated on two backbones (DSTree and MESSI) and validated on five datasets, showing consistent gains over state-of-the-art early-stopping and reordering methods. This work provides a practical pathway to accelerate large-scale data-series analytics and lays groundwork for future extensions to updates, other index types, and more diverse learned-model choices.

Abstract

The ever-growing collections of data series create a pressing need for efficient similarity search, which serves as the backbone for various analytics pipelines. Recent studies have shown that tree-based series indexes excel in many scenarios. However, we observe a significant waste of effort during search, due to suboptimal pruning. To address this issue, we introduce LeaFi, a novel framework that uses machine learning models to boost pruning effectiveness of tree-based data series indexes. These models act as learned filters, which predict tight node-wise distance lower bounds that are used to make pruning decisions, thus, improving pruning effectiveness. We describe the LeaFi-enhanced index building algorithm, which selects leaf nodes and generates training data to insert and train machine learning models, as well as the LeaFi-enhanced search algorithm, which calibrates learned filters at query time to support the user-defined quality target of each query. Our experimental evaluation, using two different tree-based series indexes and five diverse datasets, demonstrates the advantages of the proposed approach. LeaFi-enhanced data-series indexes improve pruning ratio by up to 20x and search time by up to 32x, while maintaining a target recall of 99%.

LeaFi: Data Series Indexes on Steroids with Learned Filters

TL;DR

LeaFi introduces learned filters to improve pruning in tree-based data series indexes, addressing inefficiencies in traditional lower-bound pruning. By selecting a subset of leaf nodes, generating global and local training data, and calibrating predictions with conformal auto-tuners, LeaFi delivers up to 20x pruning and 32x faster searches while maintaining 99% recall. The framework is instantiated on two backbones (DSTree and MESSI) and validated on five datasets, showing consistent gains over state-of-the-art early-stopping and reordering methods. This work provides a practical pathway to accelerate large-scale data-series analytics and lays groundwork for future extensions to updates, other index types, and more diverse learned-model choices.

Abstract

The ever-growing collections of data series create a pressing need for efficient similarity search, which serves as the backbone for various analytics pipelines. Recent studies have shown that tree-based series indexes excel in many scenarios. However, we observe a significant waste of effort during search, due to suboptimal pruning. To address this issue, we introduce LeaFi, a novel framework that uses machine learning models to boost pruning effectiveness of tree-based data series indexes. These models act as learned filters, which predict tight node-wise distance lower bounds that are used to make pruning decisions, thus, improving pruning effectiveness. We describe the LeaFi-enhanced index building algorithm, which selects leaf nodes and generates training data to insert and train machine learning models, as well as the LeaFi-enhanced search algorithm, which calibrates learned filters at query time to support the user-defined quality target of each query. Our experimental evaluation, using two different tree-based series indexes and five diverse datasets, demonstrates the advantages of the proposed approach. LeaFi-enhanced data-series indexes improve pruning ratio by up to 20x and search time by up to 32x, while maintaining a target recall of 99%.

Paper Structure

This paper contains 30 sections, 4 equations, 14 figures, 4 tables, 4 algorithms.

Figures (14)

  • Figure 1: A waste of data series search time, caused by insufficient pruning, is observed across various datasets. Employing the LeaFi predictions for the optimal lower bounds, instead of the current summarization-based lower bounds, improves the pruning ratios significantly.
  • Figure 1: Filter inference time ($\mu$s) and the node size threshold $th$ derived for Deep (length 96).
  • Figure 2: An illustration of a LeaFi-enhanced tree-based index structure, along with an example of its search procedure.
  • Figure 3: The optimal search time that can be possibly achieved by early-stopping approaches DBLP:journals/pvldb/EchihabiZPB19DBLP:conf/sigmod/GogolouTEBP20DBLP:journals/vldb/EchihabiTGBP23, leaf node reordering approaches SC:kang2021case and LeaFi for DSTree index on Astro dataset. The axes are the same as Figure \ref{['fig:intro-motiv-node-nn']} (x-axis in Figure \ref{['fig:liter-optimal-leafi']} has a different scale).
  • Figure 3: Indexing time breakdown (minutes) for LeaFi-enhanced DSTree and MESSI on Seismic 100M.
  • ...and 9 more figures