Approximate Optimal Active Learning of Decision Trees
Zunchen Huang, Chenglu Jin
TL;DR
<3-5 sentence high-level summary> This work addresses the problem of reconstructing an unknown Boolean function implemented as a depth-bounded decision tree using membership queries. It introduces a symbolic framework that encodes the entire hypothesis space as CNF and uses approximate model counting (ApproxMC) to guide near-optimal query selection, updating constraints as responses are observed. The SAMCAL approach provides formal guarantees (via SAT-based reasoning) while avoiding explicit enumeration, demonstrating scalable version-space reduction even for extremely large initial hypothesis spaces and employing a functional equivalence check to terminate when all remaining hypotheses are functionally identical. The method has practical impact for automated program analysis, testing, and verification scenarios where query efficiency and formal analysis are critical, and it contributes a general CNF encoding for depth-d trees alongside a counting-guided active-learning loop.
Abstract
We consider the problem of actively learning an unknown binary decision tree using only membership queries, a setting in which the learner must reason about a large hypothesis space while maintaining formal guarantees. Rather than enumerating candidate trees or relying on heuristic impurity or entropy measures, we encode the entire space of bounded-depth decision trees symbolically in SAT formulas. We propose a symbolic method for active learning of decision trees, in which approximate model counting is used to estimate the reduction of the hypothesis space caused by each potential query, enabling near-optimal query selection without full model enumeration. The resulting learner incrementally strengthens a CNF representation based on observed query outcomes, and approximate model counter ApproxMC is invoked to quantify the remaining version space in a sound and scalable manner. Additionally, when ApproxMC stagnates, a functional equivalence check is performed to verify that all remaining hypotheses are functionally identical. Experiments on decision trees show that the method reliably converges to the correct model using only a handful of queries, while retaining a rigorous SAT-based foundation suitable for formal analysis and verification.
