Detection of Common Subtrees with Identical Label Distribution
Romain Azaïs, Florian Ingels
TL;DR
The paper tackles frequent pattern mining on tree-structured data by introducing a novel pattern class: common subtrees with identical label distribution. It develops DAG-RW, a lossless tree compression based on tree ciphering under a ciphering relation $\sim$, and an algorithm that jointly performs topology- and label-based deductions with backtracking to decide ciphering between trees. The authors provide a rigorous analysis of the algorithm's time complexity, demonstrate its scalability on synthetic data, and validate its practical value through real-data experiments on INEX datasets, showing DAG-RW captures patterns missed by unlabelled or labelled subtrees while preserving label information. Overall, DAG-RW enables parsimonious, label-aware pattern mining in large tree datasets, offering improved compression and richer pattern discovery for non-Euclidean data domains.
Abstract
Frequent pattern mining is a relevant method to analyse structured data, like sequences, trees or graphs. It consists in identifying characteristic substructures of a dataset. This paper deals with a new type of patterns for tree data: common subtrees with identical label distribution. Their detection is far from obvious since the underlying isomorphism problem is graph isomorphism complete. An elaborated search algorithm is developed and analysed from both theoretical and numerical perspectives. Based on this, the enumeration of patterns is performed through a new lossless compression scheme for trees, called DAG-RW, whose complexity is investigated as well. The method shows very good properties, both in terms of computation times and analysis of real datasets from the literature. Compared to other substructures like topological subtrees and labelled subtrees for which the isomorphism problem is linear, the patterns found provide a more parsimonious representation of the data.
