Learning Tree Pattern Transformations
Daniel Neider, Leif Sabellek, Johannes Schmidt, Fabian Vehlken, Thomas Zeume
TL;DR
The paper tackles the problem of learning concise explanations for structural differences between labelled, ordered trees by representing explanations as a small set of tree pattern transformations. It introduces a pattern-based language with injective matching and body/head patterns to capture local rearrangements and substitutions, and formalizes the LearningTreeTransformations problem. It establishes strong hardness results (NP-hard, and NP-complete for restricted cases) via reductions from VertexCover and 3-SAT, and offers a practical SAT-based encoding to solve real-world instances, including CS-education data. It also discusses extending the language with interval variables to model tree edits, while highlighting the trade-offs in expressivity and tractability. Overall, the work provides a principled computational framework for extracting high-level structural explanations for tree-structured data with potential educational impact.
Abstract
Explaining why and how a tree $t$ structurally differs from another tree $t^\star$ is a question that is encountered throughout computer science, including in understanding tree-structured data such as XML or JSON data. In this article, we explore how to learn explanations for structural differences between pairs of trees from sample data: suppose we are given a set $\{(t_1, t_1^\star),\dots, (t_n, t_n^\star)\}$ of pairs of labelled, ordered trees; is there a small set of rules that explains the structural differences between all pairs $(t_i, t_i^\star)$? This raises two research questions: (i) what is a good notion of "rule" in this context?; and (ii) how can sets of rules explaining a data set be learned algorithmically? We explore these questions from the perspective of database theory by (1) introducing a pattern-based specification language for tree transformations; (2) exploring the computational complexity of variants of the above algorithmic problem, e.g. showing NP-hardness for very restricted variants; and (3) discussing how to solve the problem for data from CS education research using SAT solvers.
