Table of Contents
Fetching ...

MS-Index: Fast Top-k Subsequence Search for Multivariate Time Series under Euclidean Distance

Jens E. d'Hondt, Teun Kortekaas, Odysseas Papapetrou, Themis Palpanas

TL;DR

This work tackles exact k-nearest neighbor subsequence search on multivariate time series with ad-hoc channel selection. It introduces MS-Index, which combines per-channel DFT-based summarization with an R-tree index and the MASS convolution-based exact distance computation to prune the search space aggressively while guaranteeing correctness. The authors propose optimizations for tighter bounds and more efficient indexing, and demonstrate up to two orders of magnitude speedups over state-of-the-art baselines across 34 datasets, including long, high-channel-count series. The approach supports fixed-length subsequences and adversarial channel choices at query time, making it robust and practical for real-world multivariate sensor data analysis.

Abstract

Modern applications frequently collect and analyze temporal data in the form of multivariate time series (MTS) -- time series that contain multiple channels. A common task in this context is subsequence search, which involves identifying all MTS that contain subsequences highly similar to a query time series. In practical scenarios, not all channels of an MTS are relevant to every query. For instance, airplane sensors may gather data on a plethora of components and subsystems, but only a few of these are relevant to a specific query, such as identifying the cause of a malfunctioning landing gear, or a specific flight maneuver. Consequently, the relevant query channels are often specified at query time. In this work, we introduce the Multivariate Subsequence Index (MS-Index), a novel algorithm for nearest neighbor MTS subsequence search under Euclidean distance that supports ad-hoc selection of query channels. The algorithm is exact and demonstrates query performance that scales sublinearly to the number of query channels. We examine the properties of \name with a thorough experimental evaluation over 34 datasets, and show that it outperforms the state-of-the-art one to two orders of magnitude for both raw and normalized subsequences.

MS-Index: Fast Top-k Subsequence Search for Multivariate Time Series under Euclidean Distance

TL;DR

This work tackles exact k-nearest neighbor subsequence search on multivariate time series with ad-hoc channel selection. It introduces MS-Index, which combines per-channel DFT-based summarization with an R-tree index and the MASS convolution-based exact distance computation to prune the search space aggressively while guaranteeing correctness. The authors propose optimizations for tighter bounds and more efficient indexing, and demonstrate up to two orders of magnitude speedups over state-of-the-art baselines across 34 datasets, including long, high-channel-count series. The approach supports fixed-length subsequences and adversarial channel choices at query time, making it robust and practical for real-world multivariate sensor data analysis.

Abstract

Modern applications frequently collect and analyze temporal data in the form of multivariate time series (MTS) -- time series that contain multiple channels. A common task in this context is subsequence search, which involves identifying all MTS that contain subsequences highly similar to a query time series. In practical scenarios, not all channels of an MTS are relevant to every query. For instance, airplane sensors may gather data on a plethora of components and subsystems, but only a few of these are relevant to a specific query, such as identifying the cause of a malfunctioning landing gear, or a specific flight maneuver. Consequently, the relevant query channels are often specified at query time. In this work, we introduce the Multivariate Subsequence Index (MS-Index), a novel algorithm for nearest neighbor MTS subsequence search under Euclidean distance that supports ad-hoc selection of query channels. The algorithm is exact and demonstrates query performance that scales sublinearly to the number of query channels. We examine the properties of \name with a thorough experimental evaluation over 34 datasets, and show that it outperforms the state-of-the-art one to two orders of magnitude for both raw and normalized subsequences.

Paper Structure

This paper contains 40 sections, 1 theorem, 6 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

lemma 1

Any indexed subsequence $\bm{T}$ of length $|\bm{Q}|$ that is part of the $k$-NN of $\bm{Q}$ is guaranteed to be in the set of subsequences returned by MS-Index.

Figures (7)

  • Figure 1: Example query and 1NN for MTS of synthetic airplane data. Altitude and landing gear are the query channels. The highlighted boxes (red) are the considered subsequences.
  • Figure 2: The price of a stock over time (blue), and its reconstruction through its first three DFT coefficients (orange).
  • Figure 3: Summarization and indexing of MTS subsequences. We represent the indexed subsequences of three MTS in the feature space with blue, green, and orange nodes. Time-neighbouring subsequences are connected with lines.
  • Figure 4: (a) Cumulative and absolute % of total distance and energy across DFT coefficients on the temperature channel of a weather dataset; (b) Query execution in MS-Index. Numbers in tree nodes indicate the lower bound distance of the respective MBR to the query.
  • Figure 5: Different partitioning strategies for a 2-dimensional feature space; the STR algorithm (left) and the proposed weighted partitioning (right), that leads to smaller MBRs and to tighter bounds.
  • ...and 2 more figures

Theorems & Definitions (1)

  • lemma 1