Fast and Exact Similarity Search in less than a Blink of an Eye

Patrick Schäfer; Jakob Brand; Ulf Leser; Botao Peng; Themis Palpanas

Fast and Exact Similarity Search in less than a Blink of an Eye

Patrick Schäfer, Jakob Brand, Ulf Leser, Botao Peng, Themis Palpanas

TL;DR

The paper tackles the challenge of exact similarity search on large data-series collections, where SAX-based methods falter for high-frequency signals. It introduces SOFA, a fast, exact index that combines Symbolic Fourier Approximation (SFA), a learned symbolic representation, with a MESSI-inspired tree index and SIMD-accelerated GEMINI-based search. Through a large-scale benchmark of 17 diverse datasets totaling 1 billion series, SOFA demonstrates substantial speedups over state-of-the-art methods, including up to 38x faster queries on high-frequency data and consistent improvements across 1-NN and k-NN tasks. The work highlights the effectiveness of data-adaptive frequency-domain quantization and vectorized distance computations for scalable, exact similarity search in time-series and related data-series domains.

Abstract

Similarity search is a fundamental operation for analyzing data series (DS), which are ordered sequences of real values. To enhance efficiency, summarization techniques are employed that reduce the dimensionality of DS. SAX-based approaches are the state-of-the-art for exact similarity queries, but their performance degrades for high-frequency signals, such as noisy data, or for high-frequency DS. In this work, we present the SymbOlic Fourier Approximation index (SOFA), which implements fast, exact similarity queries. SOFA is based on two building blocks: a tree index (inspired by MESSI) and the SFA symbolic summarization. It makes use of a learned summarization method called Symbolic Fourier Approximation (SFA), which is based on the Fourier transform and utilizes a data-adaptive quantization of the frequency domain. To better capture relevant information in high-frequency signals, SFA selects the Fourier coefficients by highest variance, resulting in a larger value range, thus larger quantization bins. The tree index solution employed by SOFA makes use of the GEMINI-approach to answer exact similarity search queries using lower bounding distance measures, and an efficient SIMD implementation. We further propose a novel benchmark comprising $17$ diverse datasets, encompassing 1 billion DS. Our experimental results demonstrate that SOFA outperforms existing methods on exact similarity queries: it is up to 10 times faster than a parallel sequential scan, 3-4 times faster than FAISS, and 2 times faster on average than MESSI. For high-frequency datasets, we observe a remarkable 38-fold performance improvement.

Fast and Exact Similarity Search in less than a Blink of an Eye

TL;DR

Abstract

Fast and Exact Similarity Search in less than a Blink of an Eye

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)

Theorems & Definitions (4)