Table of Contents
Fetching ...

Scalable Sampling for High Utility Patterns

Lamine Diop, Marc Plantevit

TL;DR

This work proposes a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems based on two original theorems and presents a compelling use case involving archaeological knowledge graph sub-profiles discovery.

Abstract

Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as enumeration-based strategies struggle due to the vast search space involved. To tackle this challenge, output space sampling methods have emerged as a promising solution thanks to its ability to discover valuable patterns with reduced computational overhead. However, existing sampling methods often encounter limitations when dealing with large quantitative database, resulting in scalability-related challenges. In this work, we propose a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems. Our approach ensures both the interactivity required for user-centered methods and strong statistical guarantees through random sampling. Thanks to our method, users can instantly discover relevant and representative utility pattern, facilitating efficient exploration of the database within seconds. To demonstrate the interest of our approach, we present a compelling use case involving archaeological knowledge graph sub-profiles discovery. Experiments on semantic and none-semantic quantitative databases show that our approach outperforms the state-of-the art methods.

Scalable Sampling for High Utility Patterns

TL;DR

This work proposes a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems based on two original theorems and presents a compelling use case involving archaeological knowledge graph sub-profiles discovery.

Abstract

Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as enumeration-based strategies struggle due to the vast search space involved. To tackle this challenge, output space sampling methods have emerged as a promising solution thanks to its ability to discover valuable patterns with reduced computational overhead. However, existing sampling methods often encounter limitations when dealing with large quantitative database, resulting in scalability-related challenges. In this work, we propose a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems. Our approach ensures both the interactivity required for user-centered methods and strong statistical guarantees through random sampling. Thanks to our method, users can instantly discover relevant and representative utility pattern, facilitating efficient exploration of the database within seconds. To demonstrate the interest of our approach, we present a compelling use case involving archaeological knowledge graph sub-profiles discovery. Experiments on semantic and none-semantic quantitative databases show that our approach outperforms the state-of-the art methods.

Paper Structure

This paper contains 29 sections, 2 theorems, 3 equations, 7 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathcal{V}\xspace_t\xspace$ be the upper triangle utility of any given a quantitative transaction $t$. For any positive integers $\ell$ and $i$ such that $\ell\xspace \leq i \leq {|{t\xspace}|}\xspace$, the following statement holds: $\mathcal{V}\xspace_t\xspace(\ell\xspace, i) = \binom{i-1}{\

Figures (7)

  • Figure 1: Discovery of sub-profiles in knowledge graphs
  • Figure 2: Toy knowledge graph profile
  • Figure 3: Sub-profile from the pattern $X\xspace_2X\xspace_4$
  • Figure 4: Comparing the evolution of execution time based on maximum length constraint (in gray = out of memory)
  • Figure 5: Representativeness of 1,000 drawn patterns by ${\textrm{\sc QPlus}}$ and Bootstrap
  • ...and 2 more figures

Theorems & Definitions (18)

  • Example 1
  • Definition 1: Semantic quantitative transaction and qDB
  • Example 2
  • Definition 2: Pattern utility in a transaction
  • Definition 3: Utility of a pattern
  • Example 3
  • Definition 4: Upper Triangle Utility
  • Example 4
  • Theorem 1
  • proof
  • ...and 8 more