Table of Contents
Fetching ...

Learning Ensembles of Interpretable Simple Structure

Gaurav Arwade, Sigurdur Olafsson

TL;DR

The paper tackles the challenge of achieving both interpretability and accuracy in predictive modeling by introducing simple structures, localized data subsets where interactions are reduced, and training simple models within each subset. It proposes a bottom-up ensemble approach that identifies simple structures using a KNN-based growth process, enhanced with heuristics to handle realistic data violations and a centroid-based refinement to improve recall. Empirical results on synthetic and open-source datasets show that ensembles of simple, interpretable models can match or exceed the performance of complex black-box models while offering transparent decision boundaries. This approach provides a principled framework for decision support in domains where transparency is essential, such as operations research and precision medicine, by balancing explainability with predictive power without relying on strong distributional assumptions.

Abstract

Decision-making in complex systems often relies on machine learning models, yet highly accurate models such as XGBoost and neural networks can obscure the reasoning behind their predictions. In operations research applications, understanding how a decision is made is often as crucial as the decision itself. Traditional interpretable models, such as decision trees and logistic regression, provide transparency but may struggle with datasets containing intricate feature interactions. However, complexity in decision-making stem from interactions that are only relevant within certain subsets of data. Within these subsets, feature interactions may be simplified, forming simple structures where simple interpretable models can perform effectively. We propose a bottom-up simple structure-identifying algorithm that partitions data into interpretable subgroups known as simple structure, where feature interactions are minimized, allowing simple models to be trained within each subgroup. We demonstrate the robustness of the algorithm on synthetic data and show that the decision boundaries derived from simple structures are more interpretable and aligned with the intuition of the domain than those learned from a global model. By improving both explainability and predictive accuracy, our approach provides a principled framework for decision support in applications where model transparency is essential.

Learning Ensembles of Interpretable Simple Structure

TL;DR

The paper tackles the challenge of achieving both interpretability and accuracy in predictive modeling by introducing simple structures, localized data subsets where interactions are reduced, and training simple models within each subset. It proposes a bottom-up ensemble approach that identifies simple structures using a KNN-based growth process, enhanced with heuristics to handle realistic data violations and a centroid-based refinement to improve recall. Empirical results on synthetic and open-source datasets show that ensembles of simple, interpretable models can match or exceed the performance of complex black-box models while offering transparent decision boundaries. This approach provides a principled framework for decision support in domains where transparency is essential, such as operations research and precision medicine, by balancing explainability with predictive power without relying on strong distributional assumptions.

Abstract

Decision-making in complex systems often relies on machine learning models, yet highly accurate models such as XGBoost and neural networks can obscure the reasoning behind their predictions. In operations research applications, understanding how a decision is made is often as crucial as the decision itself. Traditional interpretable models, such as decision trees and logistic regression, provide transparency but may struggle with datasets containing intricate feature interactions. However, complexity in decision-making stem from interactions that are only relevant within certain subsets of data. Within these subsets, feature interactions may be simplified, forming simple structures where simple interpretable models can perform effectively. We propose a bottom-up simple structure-identifying algorithm that partitions data into interpretable subgroups known as simple structure, where feature interactions are minimized, allowing simple models to be trained within each subgroup. We demonstrate the robustness of the algorithm on synthetic data and show that the decision boundaries derived from simple structures are more interpretable and aligned with the intuition of the domain than those learned from a global model. By improving both explainability and predictive accuracy, our approach provides a principled framework for decision support in applications where model transparency is essential.

Paper Structure

This paper contains 17 sections, 3 theorems, 20 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Let $S_1$ be a simple structure composed of two linearly separable subsets $S_{1}^1$ and $S_{1}^2$. Let $R_{1}(x)$ denote the subset constructed from a starting point $x\in S_1$ using the simple structure identifying algorithm. If Assumption 1 and Assumption 2 hold, then vanilla simple structure ide

Figures (6)

  • Figure 1: Comparison of decision boundaries for data with underlying simple structures: (a) Logistic Regression decision boundary underfits the data, failing to capture the underlying patterns (b) Neural Network decision boundaries fit the data perfectly but are overly complex (c) Ensemble of Logistic Regression models on the identified simple structures fits the data perfectly while maintaining simple and interpretable decision boundaries
  • Figure 2: (a) Parent synthetic data with two Gaussian distributions containing multiple classes: $S_1$ comprises subsets $S_{1}^{1}$ (majority class 1) and $S_{1}^{2}$ (majority class 2) linearly separated, while $S_2$ consists of subsets $S_{2}^{1}$ (class 2), $S_{2}^{2}$ (class 3), and $S_{2}^{3}$ (class 1) linearly separated. (b) Identified simple structures using the simple structure identifying algorithm with a violation of Assumption 1, based on a bootstrap sample from (a).
  • Figure 3: Comparison of bootstrap accuracy estimates for a single decision tree on the entire dataset versus an ensemble of decision trees on identified simple structures in synthetic data. Additionally, a comparison of various strategies for handling unassigned instances in the simple structure identifying algorithm. The x-axis denotes the violation of Assumption 1, represented as the percentage overlap of $S_1$.
  • Figure 4: Assessment of the robustness of the simple structure identifying algorithm regarding violation of Assumption 1. The x-axis illustrates the degree of violation, denoted as the percentage overlap of $S_1$. Additionally, a comparison of bootstrap estimates of precision and recall for simple structure $S_1$ from synthetic data before and after centroid allocation in the simple structure identification algorithm.
  • Figure 5: Testing the robustness of the Simple Structure Algorithm and Gaussian Mixture models in identifying underlying simple structures across various data distributions. (a) Scenario with a Single Gaussian distribution containing a single class. (b) Scenario with a Single Gaussian distribution containing multiple classes. (c) Scenario with a Single Exponential distribution containing a single class. (d) Scenario with a Single Exponential distribution containing multiple classes.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Theorem 2
  • proof