Learning Ensembles of Interpretable Simple Structure
Gaurav Arwade, Sigurdur Olafsson
TL;DR
The paper tackles the challenge of achieving both interpretability and accuracy in predictive modeling by introducing simple structures, localized data subsets where interactions are reduced, and training simple models within each subset. It proposes a bottom-up ensemble approach that identifies simple structures using a KNN-based growth process, enhanced with heuristics to handle realistic data violations and a centroid-based refinement to improve recall. Empirical results on synthetic and open-source datasets show that ensembles of simple, interpretable models can match or exceed the performance of complex black-box models while offering transparent decision boundaries. This approach provides a principled framework for decision support in domains where transparency is essential, such as operations research and precision medicine, by balancing explainability with predictive power without relying on strong distributional assumptions.
Abstract
Decision-making in complex systems often relies on machine learning models, yet highly accurate models such as XGBoost and neural networks can obscure the reasoning behind their predictions. In operations research applications, understanding how a decision is made is often as crucial as the decision itself. Traditional interpretable models, such as decision trees and logistic regression, provide transparency but may struggle with datasets containing intricate feature interactions. However, complexity in decision-making stem from interactions that are only relevant within certain subsets of data. Within these subsets, feature interactions may be simplified, forming simple structures where simple interpretable models can perform effectively. We propose a bottom-up simple structure-identifying algorithm that partitions data into interpretable subgroups known as simple structure, where feature interactions are minimized, allowing simple models to be trained within each subgroup. We demonstrate the robustness of the algorithm on synthetic data and show that the decision boundaries derived from simple structures are more interpretable and aligned with the intuition of the domain than those learned from a global model. By improving both explainability and predictive accuracy, our approach provides a principled framework for decision support in applications where model transparency is essential.
