Table of Contents
Fetching ...

Towards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations

Kyaw Hpone Myint, Zhe Wu, Alexandre G. R. Day, Giri Iyengar

TL;DR

The paper tackles the challenge of meta-learning interpretable, near-optimal decision trees without reliance on costly real-world data or exhaustive optimal-tree solvers. It introduces an SCM-based synthetic data workflow that, via a four-step pipeline and targeted quality filters, enables scalable pre-training of the MetaTree transformer on large, diverse datasets with near-optimal targets. Empirical results show that MetaTree trained on synthetic data achieves accuracy close to a hand-curated baseline and competitive performance against CART and GOSDT as the number of trees grows, while offering constant-time data generation and scalable training. This approach promises practical gains for high-stakes domains by delivering efficient, interpretable decision-tree models with reduced data and compute requirements.

Abstract

Decision trees are widely used in high-stakes fields like finance and healthcare due to their interpretability. This work introduces an efficient, scalable method for generating synthetic pre-training data to enable meta-learning of decision trees. Our approach samples near-optimal decision trees synthetically, creating large-scale, realistic datasets. Using the MetaTree transformer architecture, we demonstrate that this method achieves performance comparable to pre-training on real-world data or with computationally expensive optimal decision trees. This strategy significantly reduces computational costs, enhances data generation flexibility, and paves the way for scalable and efficient meta-learning of interpretable decision tree models.

Towards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations

TL;DR

The paper tackles the challenge of meta-learning interpretable, near-optimal decision trees without reliance on costly real-world data or exhaustive optimal-tree solvers. It introduces an SCM-based synthetic data workflow that, via a four-step pipeline and targeted quality filters, enables scalable pre-training of the MetaTree transformer on large, diverse datasets with near-optimal targets. Empirical results show that MetaTree trained on synthetic data achieves accuracy close to a hand-curated baseline and competitive performance against CART and GOSDT as the number of trees grows, while offering constant-time data generation and scalable training. This approach promises practical gains for high-stakes domains by delivering efficient, interpretable decision-tree models with reduced data and compute requirements.

Abstract

Decision trees are widely used in high-stakes fields like finance and healthcare due to their interpretability. This work introduces an efficient, scalable method for generating synthetic pre-training data to enable meta-learning of decision trees. Our approach samples near-optimal decision trees synthetically, creating large-scale, realistic datasets. Using the MetaTree transformer architecture, we demonstrate that this method achieves performance comparable to pre-training on real-world data or with computationally expensive optimal decision trees. This strategy significantly reduces computational costs, enhances data generation flexibility, and paves the way for scalable and efficient meta-learning of interpretable decision tree models.

Paper Structure

This paper contains 22 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Meta-learning workflow for generating look-ahead trees using synthetic data. The workflow can be divided into two parts: (1) meta-learning step where labeled synthetic datasets are fed into MetaTree along with the optimal decision trees for each dataset as the training targets, and (2) inference step where the pre-trained model is used to predict the look-ahead trees on an unseen, real-world datasets
  • Figure 2: Workflow for pre-training with synthetic data
  • Figure 3: Time complexity as a function of: (a) the depth of the tree, and (b) the number of binary features for generating pre-training targets
  • Figure 4: MetaTree scaling laws as a function of: (a) dataset size, (b) label noise, and (c) model complexity
  • Figure 5: Class imbalance distribution comparison between MetaTree benchmarks and the synthetic datasets generated using the proposed workflow. The synthetic dataset display more uniform class distribution between the selected range [0,0.3] compared to hand-curated MetaTree benchmarks. Class count distribution comparison between MetaTree benchmarks and the synthetic datasets generated using the proposed workflow. Compared to MetaTree benchmarks, which were manually hand-picked and hence display irregular distribution, the class count distribution in synthetic datasets can be well-explained by the quality filters we enforced. Both accuracy and class imbalance filters will favor datasets with smaller number of classes.