Table of Contents
Fetching ...

Distribution Agnostic Symbolic Representations for Time Series Dimensionality Reduction and Online Anomaly Detection

Konstantinos Bountrogiannis, George Tzagkarakis, Panagiotis Tsakalides

TL;DR

Time-series data are often analyzed in a lower-dimensional symbolic space, where distance-preserving properties are crucial. This paper presents two non-parametric, data-driven SAX variants, pSAX and cSAX, to overcome the Gaussian and equiprobable-symbol assumptions of conventional SAX, while preserving lower-bounding distances. It provides information-theoretic insights into SAX, identifies a variance-reduction phenomenon in piecewise aggregation, and offers practical fixes, alongside new distance notions for mean-squared-error optimization. Empirically, pSAX and cSAX outperform SAX and aSAX on real datasets in tasks like anomaly detection and discord discovery, with cSAX enabling online dynamic clustering and automatic alphabet-size selection. Overall, the work enables distribution-agnostic, efficient time-series mining and online anomaly detection with tighter distance bounds and data-driven discretization.

Abstract

Due to the importance of the lower bounding distances and the attractiveness of symbolic representations, the family of symbolic aggregate approximations (SAX) has been used extensively for encoding time series data. However, typical SAX-based methods rely on two restrictive assumptions; the Gaussian distribution and equiprobable symbols. This paper proposes two novel data-driven SAX-based symbolic representations, distinguished by their discretization steps. The first representation, oriented for general data compaction and indexing scenarios, is based on the combination of kernel density estimation and Lloyd-Max quantization to minimize the information loss and mean squared error in the discretization step. The second method, oriented for high-level mining tasks, employs the Mean-Shift clustering method and is shown to enhance anomaly detection in the lower-dimensional space. Besides, we verify on a theoretical basis a previously observed phenomenon of the intrinsic process that results in a lower than the expected variance of the intermediate piecewise aggregate approximation. This phenomenon causes an additional information loss but can be avoided with a simple modification. The proposed representations possess all the attractive properties of the conventional SAX method. Furthermore, experimental evaluation on real-world datasets demonstrates their superiority compared to the traditional SAX and an alternative data-driven SAX variant.

Distribution Agnostic Symbolic Representations for Time Series Dimensionality Reduction and Online Anomaly Detection

TL;DR

Time-series data are often analyzed in a lower-dimensional symbolic space, where distance-preserving properties are crucial. This paper presents two non-parametric, data-driven SAX variants, pSAX and cSAX, to overcome the Gaussian and equiprobable-symbol assumptions of conventional SAX, while preserving lower-bounding distances. It provides information-theoretic insights into SAX, identifies a variance-reduction phenomenon in piecewise aggregation, and offers practical fixes, alongside new distance notions for mean-squared-error optimization. Empirically, pSAX and cSAX outperform SAX and aSAX on real datasets in tasks like anomaly detection and discord discovery, with cSAX enabling online dynamic clustering and automatic alphabet-size selection. Overall, the work enables distribution-agnostic, efficient time-series mining and online anomaly detection with tighter distance bounds and data-driven discretization.

Abstract

Due to the importance of the lower bounding distances and the attractiveness of symbolic representations, the family of symbolic aggregate approximations (SAX) has been used extensively for encoding time series data. However, typical SAX-based methods rely on two restrictive assumptions; the Gaussian distribution and equiprobable symbols. This paper proposes two novel data-driven SAX-based symbolic representations, distinguished by their discretization steps. The first representation, oriented for general data compaction and indexing scenarios, is based on the combination of kernel density estimation and Lloyd-Max quantization to minimize the information loss and mean squared error in the discretization step. The second method, oriented for high-level mining tasks, employs the Mean-Shift clustering method and is shown to enhance anomaly detection in the lower-dimensional space. Besides, we verify on a theoretical basis a previously observed phenomenon of the intrinsic process that results in a lower than the expected variance of the intermediate piecewise aggregate approximation. This phenomenon causes an additional information loss but can be avoided with a simple modification. The proposed representations possess all the attractive properties of the conventional SAX method. Furthermore, experimental evaluation on real-world datasets demonstrates their superiority compared to the traditional SAX and an alternative data-driven SAX variant.

Paper Structure

This paper contains 25 sections, 2 theorems, 39 equations, 7 figures, 5 tables.

Key Result

Theorem 3.1

The expected description length, $\mathbb{E}[l(X^{\Delta})]$, of the random variable $X^{\Delta}$, assuming optimal coding under the probability density function $f_G(x)$, is bounded as follows,

Figures (7)

  • Figure 1: SAX representation of a time series. In this example, a time series of length $N=120$ is first transformed into its PAA representation by segmenting and averaging the series into $M=12$ pieces. Then, each segment is assigned a binary codeword, subject to which of the $\kappa=8$ equiprobable intervals of the standard Gaussian pdf it falls in. Each quantization interval is bounded by two cutlines and is assigned a codeword from the alphabet $A=\{0_2,1_2,\dots,7_2\}$.
  • Figure 2: The kernel functions employed by our proposed method for density estimation.
  • Figure 3: Various density-based discretization schemes. The density is estimated via KDE. Conventional SAX employs equiprobable quantization assuming a Gaussian distribution (ref. Fig. \ref{['fig:SAX']}). Our pSAX employs the Lloyd-Max quantizer and cSAX employs the mean-shift clustering. Notice the decreasing amount of energy in the dominant mode.
  • Figure 4: ROC curves for different training set sizes, with no dimensionality reduction. The training set sizes are expressed in percentages of the original dataset. Note that for the uniform quantizer, the optimal average alphabet size is set empirically ($\kappa = 10$, fixed for all datasets), whilst the mean-shift automatically detects the number of clusters.
  • Figure 5: ROC curves (top) and AUC values (bottom table) of the SAX-based representations (without dimensionality reduction) on NAB's datasets for different training set sizes: a) 20% of total samples, b) 33% of total samples, c) 66% of total samples, d) 100% of total samples.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Corollary
  • proof
  • Definition : Dynamic clustering criterion
  • proof