Table of Contents
Fetching ...

Scalable Order-Preserving Pattern Mining

Ling Li, Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, Maria Matsangidou

TL;DR

This work tackles scalable, exact pattern mining for time series under the order-preserving (OP) relation by introducing an OP suffix tree (OPST) as an efficient index. The authors provide a practical $O(n\sigma\log\sigma)$-time, $O(n)$-space OPST construction and a faster $O(n\log\sigma)$-time variant, then develop linear-time, linear-space mining algorithms for both maximal and closed OP patterns using phase-based traversals and LCA-based checks. Empirical results on multi-million-letter datasets show near-linear practical performance for OPST construction and orders-of-magnitude speedups over state-of-the-art methods and natural baselines, with strong support for clustering applications based on OP patterns. The work demonstrates that OP-based pattern discovery can scale to real-world, large-scale time series and provide meaningful, compact pattern-based representations for downstream tasks.

Abstract

Time series are ubiquitous in domains ranging from medicine to marketing and finance. Frequent Pattern Mining (FPM) from a time series has thus received much attention. Recently, it has been studied under the order-preserving (OP) matching relation stating that a match occurs when two time series have the same relative order on their elements. Here, we propose exact, highly scalable algorithms for FPM in the OP setting. Our algorithms employ an OP suffix tree (OPST) as an index to store and query time series efficiently. Unfortunately, there are no practical algorithms for OPST construction. Thus, we first propose a novel and practical $\mathcal{O}(nσ\log σ)$-time and $\mathcal{O}(n)$-space algorithm for constructing the OPST of a length-$n$ time series over an alphabet of size $σ$. We also propose an alternative faster OPST construction algorithm running in $\mathcal{O}(n\log σ)$ time using $\mathcal{O}(n)$ space; this algorithm is mainly of theoretical interest. Then, we propose an exact $\mathcal{O}(n)$-time and $\mathcal{O}(n)$-space algorithm for mining all maximal frequent OP patterns, given an OPST. This significantly improves on the state of the art, which takes $Ω(n^3)$ time in the worst case. We also formalize the notion of closed frequent OP patterns and propose an exact $\mathcal{O}(n)$-time and $\mathcal{O}(n)$-space algorithm for mining all closed frequent OP patterns, given an OPST. We conducted experiments using real-world, multi-million letter time series showing that our $\mathcal{O}(nσ\log σ)$-time OPST construction algorithm runs in $\mathcal{O}(n)$ time on these datasets despite the $\mathcal{O}(nσ\log σ)$ bound; that our frequent pattern mining algorithms are up to orders of magnitude faster than the state of the art and natural Apriori-like baselines; and that OP pattern-based clustering is effective.

Scalable Order-Preserving Pattern Mining

TL;DR

This work tackles scalable, exact pattern mining for time series under the order-preserving (OP) relation by introducing an OP suffix tree (OPST) as an efficient index. The authors provide a practical -time, -space OPST construction and a faster -time variant, then develop linear-time, linear-space mining algorithms for both maximal and closed OP patterns using phase-based traversals and LCA-based checks. Empirical results on multi-million-letter datasets show near-linear practical performance for OPST construction and orders-of-magnitude speedups over state-of-the-art methods and natural baselines, with strong support for clustering applications based on OP patterns. The work demonstrates that OP-based pattern discovery can scale to real-world, large-scale time series and provide meaningful, compact pattern-based representations for downstream tasks.

Abstract

Time series are ubiquitous in domains ranging from medicine to marketing and finance. Frequent Pattern Mining (FPM) from a time series has thus received much attention. Recently, it has been studied under the order-preserving (OP) matching relation stating that a match occurs when two time series have the same relative order on their elements. Here, we propose exact, highly scalable algorithms for FPM in the OP setting. Our algorithms employ an OP suffix tree (OPST) as an index to store and query time series efficiently. Unfortunately, there are no practical algorithms for OPST construction. Thus, we first propose a novel and practical -time and -space algorithm for constructing the OPST of a length- time series over an alphabet of size . We also propose an alternative faster OPST construction algorithm running in time using space; this algorithm is mainly of theoretical interest. Then, we propose an exact -time and -space algorithm for mining all maximal frequent OP patterns, given an OPST. This significantly improves on the state of the art, which takes time in the worst case. We also formalize the notion of closed frequent OP patterns and propose an exact -time and -space algorithm for mining all closed frequent OP patterns, given an OPST. We conducted experiments using real-world, multi-million letter time series showing that our -time OPST construction algorithm runs in time on these datasets despite the bound; that our frequent pattern mining algorithms are up to orders of magnitude faster than the state of the art and natural Apriori-like baselines; and that OP pattern-based clustering is effective.

Paper Structure

This paper contains 24 sections, 11 theorems, 2 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Let $x$ and $y$ be two strings. Then

Figures (7)

  • Figure 1: Suffixes of $w$, $\textsf{\small SufCode}\xspace(w)$, and $\textsf{\small OPST}\xspace(w)$ for string $w=1~2~4~4~2~5~5~1$ over alphabet $\Sigma=[1,5]$ of size $\sigma=5$.
  • Figure 2: OPST (construction algorithm): (a) Runtime and (b) peak memory consumption for varying $n$. (c) Runtime and (d) peak memory consumption for varying $\sigma$.
  • Figure 3: OPST-MP, BA-MP, and MOPP: (a) Runtime and (b) peak memory consumption for varying $n$. (c) Runtime and (d) peak memory consumption for varying $\tau$. Missing bars for BA-MP and MOPP indicate that they did not finish within 24 hours. The value above each pair of bars in (a) and (c) represents the maximum length $k$ of all $\tau$-maximal $\tau$-frequent OP patterns.
  • Figure 5: OPST-MP: (a) Runtime and (b) peak memory consumption on all datasets for varying $\tau$. The value above each bar in (a) is the number of $\tau$-maximal $\tau$-frequent OP patterns. BA-MP and MOPP are omitted; they were slower than OPST-MP by more than one order of magnitude on average.
  • Figure 6: OPST-CP: (a) Runtime and (b) peak memory consumption on all datasets for varying $\tau$. The value above each bar in (a) is the number of closed $\tau$-frequent OP patterns. BA-CP is omitted, as it was slower than OPST-CP by more than one order of magnitude on average across all datasets.
  • ...and 2 more figures

Theorems & Definitions (24)

  • Example 1
  • Example 2: cont'd from Example \ref{['example1']}
  • Example 3
  • Example 4
  • Example 5
  • Lemma 1: DBLP:journals/tcs/CrochemoreIKKLP16
  • Example 6: cont'd from Example \ref{['bg:ex1']}
  • Lemma 2: DBLP:journals/tcs/CrochemoreIKKLP16
  • Lemma 3: DBLP:journals/tcs/CrochemoreIKKLP16
  • Example 7
  • ...and 14 more