Table of Contents
Fetching ...

Efficient Mining of Low-Utility Sequential Patterns

Jian Zhu, Zhidong Lin, Wensheng Gan, Ruichu Cai, Zhifeng Hao, Philip S. Yu

TL;DR

This work formalizes the LUSPM problem, redefine sequence utility, and introduces a compact data structure called the sequence-utility chain to efficiently record utility information and propose three novel algorithm--LUSPM_b, LUSPM_s, and LUSPM_e--to discover the complete set of low-utility sequential patterns.

Abstract

Discovering valuable insights from rich data is a crucial task for exploratory data analysis. Sequential pattern mining (SPM) has found widespread applications across various domains. In recent years, low-utility sequential pattern mining (LUSPM) has shown strong potential in applications such as intrusion detection and genomic sequence analysis. However, existing research in utility-based SPM focuses on high-utility sequential patterns, and the definitions and strategies used in high-utility SPM cannot be directly applied to LUSPM. Moreover, no algorithms have yet been developed specifically for mining low-utility sequential patterns. To address these problems, we formalize the LUSPM problem, redefine sequence utility, and introduce a compact data structure called the sequence-utility chain to efficiently record utility information. Furthermore, we propose three novel algorithm--LUSPM_b, LUSPM_s, and LUSPM_e--to discover the complete set of low-utility sequential patterns. LUSPM_b serves as an exhaustive baseline, while LUSPM_s and LUSPM_e build upon it, generating subsequences through shrinkage and extension operations, respectively. In addition, we introduce the maximal non-mutually contained sequence set and incorporate multiple pruning strategies, which significantly reduce redundant operations in both LUSPM_s and LUSPM_e. Finally, extensive experimental results demonstrate that both LUSPM_s and LUSPM_e substantially outperform LUSPM_b and exhibit excellent scalability. Notably, LUSPM_e achieves superior efficiency, requiring less runtime and memory consumption than LUSPM_s. Our code is available at https://github.com/Zhidong-Lin/LUSPM.

Efficient Mining of Low-Utility Sequential Patterns

TL;DR

This work formalizes the LUSPM problem, redefine sequence utility, and introduces a compact data structure called the sequence-utility chain to efficiently record utility information and propose three novel algorithm--LUSPM_b, LUSPM_s, and LUSPM_e--to discover the complete set of low-utility sequential patterns.

Abstract

Discovering valuable insights from rich data is a crucial task for exploratory data analysis. Sequential pattern mining (SPM) has found widespread applications across various domains. In recent years, low-utility sequential pattern mining (LUSPM) has shown strong potential in applications such as intrusion detection and genomic sequence analysis. However, existing research in utility-based SPM focuses on high-utility sequential patterns, and the definitions and strategies used in high-utility SPM cannot be directly applied to LUSPM. Moreover, no algorithms have yet been developed specifically for mining low-utility sequential patterns. To address these problems, we formalize the LUSPM problem, redefine sequence utility, and introduce a compact data structure called the sequence-utility chain to efficiently record utility information. Furthermore, we propose three novel algorithm--LUSPM_b, LUSPM_s, and LUSPM_e--to discover the complete set of low-utility sequential patterns. LUSPM_b serves as an exhaustive baseline, while LUSPM_s and LUSPM_e build upon it, generating subsequences through shrinkage and extension operations, respectively. In addition, we introduce the maximal non-mutually contained sequence set and incorporate multiple pruning strategies, which significantly reduce redundant operations in both LUSPM_s and LUSPM_e. Finally, extensive experimental results demonstrate that both LUSPM_s and LUSPM_e substantially outperform LUSPM_b and exhibit excellent scalability. Notably, LUSPM_e achieves superior efficiency, requiring less runtime and memory consumption than LUSPM_s. Our code is available at https://github.com/Zhidong-Lin/LUSPM.

Paper Structure

This paper contains 28 sections, 5 theorems, 5 equations, 9 figures, 4 tables, 8 algorithms.

Key Result

Theorem 1

For a sequence $F$ and an item $a$ at position $j$ in $F$, $u$($a$, $j$ - 1, $F$) $\leq$$u$($F$).

Figures (9)

  • Figure 1: Sequence-utility chain of sequences$\langle$$a$, $b$, $c$$\rangle$, $\langle$$a$, $b$$\rangle$, $\langle$$a$$\rangle$, and $\langle$$g$, $a$$\rangle$.
  • Figure 2: Shrinkage search tree of sequence $\langle$$a$, $b$, $c$, $a$, $d$$\rangle$ when minUtil = 4.
  • Figure 3: Extension search tree of sequence $\langle$$a$, $b$, $c$, $a$, $d$$\rangle$ when minUtil = 4.
  • Figure 4: Time consumption analysis of LUSPM$_{s}$ and LUSPM$_{e}$
  • Figure 5: Memory consumption analysis of LUSPM$_s$ and LUSPM$_e$
  • ...and 4 more figures

Theorems & Definitions (28)

  • Definition 1: Matching yin2012uspan
  • Definition 2: Q-sequence Containing yin2012uspan
  • Definition 3: Length of Sequence yin2012uspan
  • Definition 4: Support of Sequence agrawal1995mining
  • Definition 5: Utility of Q-item
  • Definition 6: Utility of Sequence yin2012uspan
  • Definition 7: Utility of Database yin2012uspan
  • Definition 8: Low-utility Sequential Pattern
  • Definition 9: Sequence Shrinkage and Removed-index
  • Definition 10: Sequence Extension ayres2002sequential
  • ...and 18 more