Table of Contents
Fetching ...

Detection of Common Subtrees with Identical Label Distribution

Romain Azaïs, Florian Ingels

TL;DR

The paper tackles frequent pattern mining on tree-structured data by introducing a novel pattern class: common subtrees with identical label distribution. It develops DAG-RW, a lossless tree compression based on tree ciphering under a ciphering relation $\sim$, and an algorithm that jointly performs topology- and label-based deductions with backtracking to decide ciphering between trees. The authors provide a rigorous analysis of the algorithm's time complexity, demonstrate its scalability on synthetic data, and validate its practical value through real-data experiments on INEX datasets, showing DAG-RW captures patterns missed by unlabelled or labelled subtrees while preserving label information. Overall, DAG-RW enables parsimonious, label-aware pattern mining in large tree datasets, offering improved compression and richer pattern discovery for non-Euclidean data domains.

Abstract

Frequent pattern mining is a relevant method to analyse structured data, like sequences, trees or graphs. It consists in identifying characteristic substructures of a dataset. This paper deals with a new type of patterns for tree data: common subtrees with identical label distribution. Their detection is far from obvious since the underlying isomorphism problem is graph isomorphism complete. An elaborated search algorithm is developed and analysed from both theoretical and numerical perspectives. Based on this, the enumeration of patterns is performed through a new lossless compression scheme for trees, called DAG-RW, whose complexity is investigated as well. The method shows very good properties, both in terms of computation times and analysis of real datasets from the literature. Compared to other substructures like topological subtrees and labelled subtrees for which the isomorphism problem is linear, the patterns found provide a more parsimonious representation of the data.

Detection of Common Subtrees with Identical Label Distribution

TL;DR

The paper tackles frequent pattern mining on tree-structured data by introducing a novel pattern class: common subtrees with identical label distribution. It develops DAG-RW, a lossless tree compression based on tree ciphering under a ciphering relation , and an algorithm that jointly performs topology- and label-based deductions with backtracking to decide ciphering between trees. The authors provide a rigorous analysis of the algorithm's time complexity, demonstrate its scalability on synthetic data, and validate its practical value through real-data experiments on INEX datasets, showing DAG-RW captures patterns missed by unlabelled or labelled subtrees while preserving label information. Overall, DAG-RW enables parsimonious, label-aware pattern mining in large tree datasets, offering improved compression and richer pattern discovery for non-Euclidean data domains.

Abstract

Frequent pattern mining is a relevant method to analyse structured data, like sequences, trees or graphs. It consists in identifying characteristic substructures of a dataset. This paper deals with a new type of patterns for tree data: common subtrees with identical label distribution. Their detection is far from obvious since the underlying isomorphism problem is graph isomorphism complete. An elaborated search algorithm is developed and analysed from both theoretical and numerical perspectives. Based on this, the enumeration of patterns is performed through a new lossless compression scheme for trees, called DAG-RW, whose complexity is investigated as well. The method shows very good properties, both in terms of computation times and analysis of real datasets from the literature. Compared to other substructures like topological subtrees and labelled subtrees for which the isomorphism problem is linear, the patterns found provide a more parsimonious representation of the data.
Paper Structure (62 sections, 15 theorems, 36 equations, 17 figures, 1 table, 7 algorithms)

This paper contains 62 sections, 15 theorems, 36 equations, 17 figures, 1 table, 7 algorithms.

Key Result

proposition 1

Starting from initial system given by $\mathop{\mathrm{\mathbb{B}}}\nolimits$ and $\mathop{\mathrm{\mathbb{C}}}\nolimits$, the number of states of any backtracking tree aiming to complete $\mathop{\mathrm{\phi}}\nolimits$ into a tree ciphering isomorphism is upper-bounded by $2(e-1)N(\mathop{\mathrm

Figures (17)

  • Figure 1: From left to right: a labelled tree $T$, its unlabelled DAG compression $\mathop{\mathrm{\mathfrak{R}}}\nolimits_\simeq(T)$ and its labelled DAG compression $\mathop{\mathrm{\mathfrak{R}}}\nolimits_{\simeq_l}(T)$. Nodes with same equivalence class (with respect to $\simeq$) are colored accordingly.
  • Figure 2: Two topologically isomorphic labelled trees $T_1$ and $T_2$, as well as an example of tree isomorphism $\mathop{\mathrm{\phi}}\nolimits$ between them (left). $\mathop{\mathrm{\phi}}\nolimits$ is also a tree ciphering, as the binary relation $R_{\mathop{\mathrm{\phi}}\nolimits}$ is bijective and induces a cipher $f_{\mathop{\mathrm{\phi}}\nolimits}$ (right). Nodes and mapping $\mathop{\mathrm{\phi}}\nolimits$ are colored according to the corresponding relations between labels in $f_{\mathop{\mathrm{\phi}}\nolimits}$. Note that $\mathop{\mathrm{\phi}}\nolimits$ is not the only tree isomorphism that yields a tree ciphering for these particular trees.
  • Figure 3: MapNodes
  • Figure 4: Running example: histogram of labels (step 1/6). Two topologically isomorphic labelled trees $T_1$ (left) and $T_2$ (right). The color on nodes indicates the equivalence class under $\simeq$. The nodes have been numbered from $u_1$ to $u_{16}$ in $T_1$ (resp. from $v_1$ to $v_{16}$ in $T_2$) in breadth-first search order. The histograms of labels give $H_1(2)=\lbrace D,E\rbrace$, $H_1(4) = \lbrace A,B,C\rbrace$ and $H_2(2) =\lbrace \delta,\eta\rbrace$, $H_2(4) = \lbrace \alpha,\beta,\gamma\rbrace$. Since the histograms coincide, we initialise the bags (gray boxes) by grouping together the nodes whose labels appear with the same frequency. The initial size of the search space is $N(\mathop{\mathrm{\mathbb{B}}}\nolimits)=12! \times 4! =11,496,038,400$.
  • Figure 5: Running example: depth (step 2/6). Since $u_1$ and $v_1$ are the only nodes with depth zero, via Deduction Rule \ref{['ded:bags']}, we map $\mathop{\mathrm{\phi}}\nolimits(u_1)=v_1$ and $f(B)=\beta$. The children of $u_1$ and $v_1$ should be set aside from the other nodes according to the SplitChildren procedure, but they are already alone in their bag. The size of the search space is now $N(\mathop{\mathrm{\mathbb{B}}}\nolimits)=5! \times 6! \times 4! = 2,073,600$.
  • ...and 12 more figures

Theorems & Definitions (24)

  • definition 1
  • definition 2
  • definition 3
  • definition 4
  • proposition 1
  • proposition 2
  • theorem 1
  • lemma 1
  • proposition 3
  • lemma 2
  • ...and 14 more