Integrating White and Black Box Techniques for Interpretable Machine Learning
Eric M. Vernon, Naoki Masuyama, Yusuke Nojima
TL;DR
The paper tackles the interpretability-accuracy trade-off in machine learning by introducing a three-component ensemble: a white-box base classifier for easy inputs, a black-box deferral classifier for hard inputs, and a white-box grader that routes inputs to either model. Training relabels data into 'easy' and 'hard' based on the base’s performance and uses SMOTE to balance the resulting dataset before training the grader; new inputs are then routed accordingly. Empirical results on multiple OpenML datasets demonstrate that the approach can maintain high final accuracy while providing interpretable reasoning for easy cases and transparent justification for when a more complex model is required. This method offers a practical pathway to deploy high-performing yet interpretable systems in real-world settings, with potential extensions to other white-box/gradient-boosted configurations and user-facing visualization tools.
Abstract
In machine learning algorithm design, there exists a trade-off between the interpretability and performance of the algorithm. In general, algorithms which are simpler and easier for humans to comprehend tend to show worse performance than more complex, less transparent algorithms. For example, a random forest classifier is likely to be more accurate than a simple decision tree, but at the expense of interpretability. In this paper, we present an ensemble classifier design which classifies easier inputs using a highly-interpretable classifier (i.e., white box model), and more difficult inputs using a more powerful, but less interpretable classifier (i.e., black box model).
