LeaFi: Data Series Indexes on Steroids with Learned Filters
Qitong Wang, Ioana Ileana, Themis Palpanas
TL;DR
LeaFi introduces learned filters to improve pruning in tree-based data series indexes, addressing inefficiencies in traditional lower-bound pruning. By selecting a subset of leaf nodes, generating global and local training data, and calibrating predictions with conformal auto-tuners, LeaFi delivers up to 20x pruning and 32x faster searches while maintaining 99% recall. The framework is instantiated on two backbones (DSTree and MESSI) and validated on five datasets, showing consistent gains over state-of-the-art early-stopping and reordering methods. This work provides a practical pathway to accelerate large-scale data-series analytics and lays groundwork for future extensions to updates, other index types, and more diverse learned-model choices.
Abstract
The ever-growing collections of data series create a pressing need for efficient similarity search, which serves as the backbone for various analytics pipelines. Recent studies have shown that tree-based series indexes excel in many scenarios. However, we observe a significant waste of effort during search, due to suboptimal pruning. To address this issue, we introduce LeaFi, a novel framework that uses machine learning models to boost pruning effectiveness of tree-based data series indexes. These models act as learned filters, which predict tight node-wise distance lower bounds that are used to make pruning decisions, thus, improving pruning effectiveness. We describe the LeaFi-enhanced index building algorithm, which selects leaf nodes and generates training data to insert and train machine learning models, as well as the LeaFi-enhanced search algorithm, which calibrates learned filters at query time to support the user-defined quality target of each query. Our experimental evaluation, using two different tree-based series indexes and five diverse datasets, demonstrates the advantages of the proposed approach. LeaFi-enhanced data-series indexes improve pruning ratio by up to 20x and search time by up to 32x, while maintaining a target recall of 99%.
