Table of Contents
Fetching ...

Fast and Exact Similarity Search in less than a Blink of an Eye

Patrick Schäfer, Jakob Brand, Ulf Leser, Botao Peng, Themis Palpanas

TL;DR

The paper tackles the challenge of exact similarity search on large data-series collections, where SAX-based methods falter for high-frequency signals. It introduces SOFA, a fast, exact index that combines Symbolic Fourier Approximation (SFA), a learned symbolic representation, with a MESSI-inspired tree index and SIMD-accelerated GEMINI-based search. Through a large-scale benchmark of 17 diverse datasets totaling 1 billion series, SOFA demonstrates substantial speedups over state-of-the-art methods, including up to 38x faster queries on high-frequency data and consistent improvements across 1-NN and k-NN tasks. The work highlights the effectiveness of data-adaptive frequency-domain quantization and vectorized distance computations for scalable, exact similarity search in time-series and related data-series domains.

Abstract

Similarity search is a fundamental operation for analyzing data series (DS), which are ordered sequences of real values. To enhance efficiency, summarization techniques are employed that reduce the dimensionality of DS. SAX-based approaches are the state-of-the-art for exact similarity queries, but their performance degrades for high-frequency signals, such as noisy data, or for high-frequency DS. In this work, we present the SymbOlic Fourier Approximation index (SOFA), which implements fast, exact similarity queries. SOFA is based on two building blocks: a tree index (inspired by MESSI) and the SFA symbolic summarization. It makes use of a learned summarization method called Symbolic Fourier Approximation (SFA), which is based on the Fourier transform and utilizes a data-adaptive quantization of the frequency domain. To better capture relevant information in high-frequency signals, SFA selects the Fourier coefficients by highest variance, resulting in a larger value range, thus larger quantization bins. The tree index solution employed by SOFA makes use of the GEMINI-approach to answer exact similarity search queries using lower bounding distance measures, and an efficient SIMD implementation. We further propose a novel benchmark comprising $17$ diverse datasets, encompassing 1 billion DS. Our experimental results demonstrate that SOFA outperforms existing methods on exact similarity queries: it is up to 10 times faster than a parallel sequential scan, 3-4 times faster than FAISS, and 2 times faster on average than MESSI. For high-frequency datasets, we observe a remarkable 38-fold performance improvement.

Fast and Exact Similarity Search in less than a Blink of an Eye

TL;DR

The paper tackles the challenge of exact similarity search on large data-series collections, where SAX-based methods falter for high-frequency signals. It introduces SOFA, a fast, exact index that combines Symbolic Fourier Approximation (SFA), a learned symbolic representation, with a MESSI-inspired tree index and SIMD-accelerated GEMINI-based search. Through a large-scale benchmark of 17 diverse datasets totaling 1 billion series, SOFA demonstrates substantial speedups over state-of-the-art methods, including up to 38x faster queries on high-frequency data and consistent improvements across 1-NN and k-NN tasks. The work highlights the effectiveness of data-adaptive frequency-domain quantization and vectorized distance computations for scalable, exact similarity search in time-series and related data-series domains.

Abstract

Similarity search is a fundamental operation for analyzing data series (DS), which are ordered sequences of real values. To enhance efficiency, summarization techniques are employed that reduce the dimensionality of DS. SAX-based approaches are the state-of-the-art for exact similarity queries, but their performance degrades for high-frequency signals, such as noisy data, or for high-frequency DS. In this work, we present the SymbOlic Fourier Approximation index (SOFA), which implements fast, exact similarity queries. SOFA is based on two building blocks: a tree index (inspired by MESSI) and the SFA symbolic summarization. It makes use of a learned summarization method called Symbolic Fourier Approximation (SFA), which is based on the Fourier transform and utilizes a data-adaptive quantization of the frequency domain. To better capture relevant information in high-frequency signals, SFA selects the Fourier coefficients by highest variance, resulting in a larger value range, thus larger quantization bins. The tree index solution employed by SOFA makes use of the GEMINI-approach to answer exact similarity search queries using lower bounding distance measures, and an efficient SIMD implementation. We further propose a novel benchmark comprising diverse datasets, encompassing 1 billion DS. Our experimental results demonstrate that SOFA outperforms existing methods on exact similarity queries: it is up to 10 times faster than a parallel sequential scan, 3-4 times faster than FAISS, and 2 times faster on average than MESSI. For high-frequency datasets, we observe a remarkable 38-fold performance improvement.

Paper Structure

This paper contains 35 sections, 8 equations, 15 figures, 5 tables, 3 algorithms.

Figures (15)

  • Figure 1: TOP: PAA (orange) fails to approximate a data series (in gray) with high frequency, resulting in a flat line. Meanwhile FFT (in blue) closely mimics the data, with both using $8$ values. BOTTOM: The distribution of values for each dataset. SAX is built upon the assumption that the data follows Normal N(0,1) distribution (dotted in green). This is neither the case case for the PAA approximations nor the raw data.
  • Figure 2: The figure illustrates the summarization of a DS using SAX (top) and SFA (bottom), both employing an 8-symbol alphabet ('a' to 'h') with $4$ to $12$ values. SAX generates a staircase-like envelope around the raw signal (shown in orange). In contrast, SFA constructs an envelope around the Fourier transform, closely approximating the original signal.
  • Figure 3: The figure illustrates the two summarization techniques iSAX (left) and SFA (right). SAX aggregates the DS over intervals using PAA, and quantizes the mean values into symbols BCED using bins derived from equi-depth binning the Gaussian distribution. SFA transforms the DS into frequency domain, and separately quantizes the real and imaginary values into symbols using learned bins.
  • Figure 4: A comparison of the iSAX (left) and SFA (right) Euclidean LBD. iSAX uses the same fixed break points for each PAA value. SFA uses learned break points for each Fourier value (mean or imaginary values).
  • Figure 5: Workflow of SOFA for exact similarity search. First, a fraction of the DS is sampled and Fourier transformed. Bins are learned, and the best Fourier coefficients selected. Using the learned transformation, all DS are transformed to create the index. To answer a query, the query DS is SFA transformed, and the MESSI-based index is used to retrieve the exact 1-NN using the SFA lower bound.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4