Table of Contents
Fetching ...

NSPG-Miner: Mining Repetitive Negative Sequential Patterns

Yan Li, Zhulin Wang, Jing Liu, Lei Guo, Philippe Fournier-Viger, Youxi Wu, Xindong Wu

TL;DR

NSPG-Miner tackles the problem of mining repetitive negative sequential patterns under gap constraints by jointly discovering PSPGs and NSPGs. It introduces two key innovations: a pattern join strategy with negative patterns to prune candidates, and the NegPair support calculator that uses prefix/suffix key-value arrays to compute pattern supports without res scanning the data. The method demonstrates improved efficiency and discovers more informative negative patterns than state-of-the-art PSPG and NSP mining algorithms across diverse datasets, including a SARS comparison case study. While slower than some negative-SPM baselines on certain datasets, NSPG-Miner provides a complete, gap-constrained NSPG mining solution with strong applicability to domains like bioinformatics and fraud or behavior analysis.

Abstract

Sequential pattern mining (SPM) with gap constraints (or repetitive SPM or tandem repeat discovery in bioinformatics) can find frequent repetitive subsequences satisfying gap constraints, which are called positive sequential patterns with gap constraints (PSPGs). However, classical SPM with gap constraints cannot find the frequent missing items in the PSPGs. To tackle this issue, this paper explores negative sequential patterns with gap constraints (NSPGs). We propose an efficient NSPG-Miner algorithm that can mine both frequent PSPGs and NSPGs simultaneously. To effectively reduce candidate patterns, we propose a pattern join strategy with negative patterns which can generate both positive and negative candidate patterns at the same time. To calculate the support (frequency of occurrence) of a pattern in each sequence, we explore a NegPair algorithm that employs a key-value pair array structure to deal with the gap constraints and the negative items simultaneously and can avoid redundant rescanning of the original sequence, thus improving the efficiency of the algorithm. To report the performance of NSPG-Miner, 11 competitive algorithms and 11 datasets are employed. The experimental results not only validate the effectiveness of the strategies adopted by NSPG-Miner, but also verify that NSPG-Miner can discover more valuable information than the state-of-the-art algorithms. Algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/NSPG-Miner.

NSPG-Miner: Mining Repetitive Negative Sequential Patterns

TL;DR

NSPG-Miner tackles the problem of mining repetitive negative sequential patterns under gap constraints by jointly discovering PSPGs and NSPGs. It introduces two key innovations: a pattern join strategy with negative patterns to prune candidates, and the NegPair support calculator that uses prefix/suffix key-value arrays to compute pattern supports without res scanning the data. The method demonstrates improved efficiency and discovers more informative negative patterns than state-of-the-art PSPG and NSP mining algorithms across diverse datasets, including a SARS comparison case study. While slower than some negative-SPM baselines on certain datasets, NSPG-Miner provides a complete, gap-constrained NSPG mining solution with strong applicability to domains like bioinformatics and fraud or behavior analysis.

Abstract

Sequential pattern mining (SPM) with gap constraints (or repetitive SPM or tandem repeat discovery in bioinformatics) can find frequent repetitive subsequences satisfying gap constraints, which are called positive sequential patterns with gap constraints (PSPGs). However, classical SPM with gap constraints cannot find the frequent missing items in the PSPGs. To tackle this issue, this paper explores negative sequential patterns with gap constraints (NSPGs). We propose an efficient NSPG-Miner algorithm that can mine both frequent PSPGs and NSPGs simultaneously. To effectively reduce candidate patterns, we propose a pattern join strategy with negative patterns which can generate both positive and negative candidate patterns at the same time. To calculate the support (frequency of occurrence) of a pattern in each sequence, we explore a NegPair algorithm that employs a key-value pair array structure to deal with the gap constraints and the negative items simultaneously and can avoid redundant rescanning of the original sequence, thus improving the efficiency of the algorithm. To report the performance of NSPG-Miner, 11 competitive algorithms and 11 datasets are employed. The experimental results not only validate the effectiveness of the strategies adopted by NSPG-Miner, but also verify that NSPG-Miner can discover more valuable information than the state-of-the-art algorithms. Algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/NSPG-Miner.

Paper Structure

This paper contains 20 sections, 5 theorems, 21 figures, 3 tables, 4 algorithms.

Key Result

Lemma 1

The total number of offset sequences of $\mathbf p$ in $SDB$ can be calculated according to $ofs(\mathbf p, SDB) = L\times W^{m-1}$, where $L$ is the length of $SDB$, $m$ is the length of $\mathbf p$, and $W = M-N+1$12_Min2020.

Figures (21)

  • Figure 1: All occurrences of pattern $\mathbf p$ in sequence $\mathbf s_{1}$
  • Figure 2: Framework of NSPG-Miner. NSPG-Miner has two essential parts: candidate pattern generation and support calculation. To effectively generate candidate patterns with length $m$ ($m>$ 2), we propose a pattern join strategy with negative patterns which can generate positive and negative candidate patterns simultaneously. To improve the efficiency of support calculation, we employ a key-value array structure to avoid redundant scanning of the whole sequence.
  • Figure 3: An illustrative example of pattern join strategy with negative patterns
  • Figure 4: Key-value pair arrays $A_1$, $A_2$, and $A$ of patterns $\mathbf p$ =a[0,1]a, $\mathbf q$=a[0,1](¬b)c, and $\mathbf t$=a[0,1]a[0,1](¬b)c in sequence $\mathbf s_{1}$, respectively.
  • Figure 5: Comparison of running time
  • ...and 16 more figures

Theorems & Definitions (31)

  • Example 1
  • Definition 1
  • Example 2
  • Definition 2
  • Example 3
  • Definition 3
  • Example 4
  • Definition 4
  • Lemma 1
  • Example 5
  • ...and 21 more