Table of Contents
Fetching ...

Learning Interpretable Models Using Uncertainty Oracles

Abhishek Ghose, Balaraman Ravindran

TL;DR

The paper tackles the challenge of achieving small, human-interpretable models without sacrificing accuracy by learning a training distribution that favors compact models. It encodes this distribution with an Infinite Beta Mixture Model via a Dirichlet Process and projects data to 1D using an uncertainty oracle, optimizing DP parameters with Bayesian Optimization to maximize held-out accuracy for various interpretable families. The approach is model-agnostic, accommodates non-differentiable losses, and supports multi-component size definitions and cross-feature-space oracles, yielding substantial improvements over baselines in many settings and showing robustness to model size and data representation. Practically, this method offers a flexible, scalable way to push the size-accuracy frontier for interpretable models across domains while preserving reproducibility and extensibility.

Abstract

A desirable property of interpretable models is small size, so that they are easily understandable by humans. This leads to the following challenges: (a) small sizes typically imply diminished accuracy, and (b) bespoke levers provided by model families to restrict size, e.g., L1 regularization, might be insufficient to reach the desired size-accuracy trade-off. We address these challenges here. Earlier work has shown that learning the training distribution creates accurate small models. Our contribution is a new technique that exploits this idea. The training distribution is encoded as a Dirichlet Process to allow for a flexible number of modes that is learnable from the data. Its parameters are learned using Bayesian Optimization; a design choice that makes the technique applicable to non-differentiable loss functions. To avoid the challenges with high dimensionality, the data is first projected down to one-dimension using uncertainty scores of a separate probabilistic model, that we refer to as the uncertainty oracle. We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe $\sim 100\%$ improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models. Additionally, we show that (1) it is more accurate than its predecessor, (2) requires only one hyperparameter to be set in practice, (3) accommodates a multi-variate notion of model size, e.g., both maximum depth of a tree and number of trees in Gradient Boosted Models, and (4) works across different feature spaces between the uncertainty oracle and the interpretable model, e.g., a GRU might act as an oracle for a decision tree that ingests n-grams.

Learning Interpretable Models Using Uncertainty Oracles

TL;DR

The paper tackles the challenge of achieving small, human-interpretable models without sacrificing accuracy by learning a training distribution that favors compact models. It encodes this distribution with an Infinite Beta Mixture Model via a Dirichlet Process and projects data to 1D using an uncertainty oracle, optimizing DP parameters with Bayesian Optimization to maximize held-out accuracy for various interpretable families. The approach is model-agnostic, accommodates non-differentiable losses, and supports multi-component size definitions and cross-feature-space oracles, yielding substantial improvements over baselines in many settings and showing robustness to model size and data representation. Practically, this method offers a flexible, scalable way to push the size-accuracy frontier for interpretable models across domains while preserving reproducibility and extensibility.

Abstract

A desirable property of interpretable models is small size, so that they are easily understandable by humans. This leads to the following challenges: (a) small sizes typically imply diminished accuracy, and (b) bespoke levers provided by model families to restrict size, e.g., L1 regularization, might be insufficient to reach the desired size-accuracy trade-off. We address these challenges here. Earlier work has shown that learning the training distribution creates accurate small models. Our contribution is a new technique that exploits this idea. The training distribution is encoded as a Dirichlet Process to allow for a flexible number of modes that is learnable from the data. Its parameters are learned using Bayesian Optimization; a design choice that makes the technique applicable to non-differentiable loss functions. To avoid the challenges with high dimensionality, the data is first projected down to one-dimension using uncertainty scores of a separate probabilistic model, that we refer to as the uncertainty oracle. We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models. Additionally, we show that (1) it is more accurate than its predecessor, (2) requires only one hyperparameter to be set in practice, (3) accommodates a multi-variate notion of model size, e.g., both maximum depth of a tree and number of trees in Gradient Boosted Models, and (4) works across different feature spaces between the uncertainty oracle and the interpretable model, e.g., a GRU might act as an oracle for a decision tree that ingests n-grams.

Paper Structure

This paper contains 23 sections, 5 equations, 11 figures, 5 tables, 4 algorithms.

Figures (11)

  • Figure 1: Application of our technique is shown on the toy dataset in (a). Learning a DT constrained to a depth of $5$ using the CART algorithm produces the regions shown in (b). Additionally learning the training distribution using our technique produces the regions in (c). For both (b) and (c) the F1-macro scores on a held-out set are reported.
  • Figure 2: Overview of our technique. Left: Training instances are characterized by their proximity to class boundaries. As a proxy for this quantity, we use the prediction uncertainty scores of a probabilistic oracle (these may also be seen as an 1D projection): higher uncertainty indicates proximity to a boundary. These scores are calculated once. Right: The size-constrained model is learned iteratively. A sampling distribution, parameterized by $\Phi$, over the uncertainty values (shown in Step 1) is used to sample training instances (as in Step 2), which is used to train a size-constrained model (shown in Step 3). Its accuracy on a held-out set - Step 4 - is used to modify $\Phi$. This loop, Steps 1-4, is executed by a BayesOpt algorithm.
  • Figure 3: Improvements in test $F1$-macro for the dataset senseit-aco for different sizes of $GBM$ models. Here, model size is determined by both max_depth per tree and number of trees. Greater improvements are seen at lower sizes.
  • Figure 4: Improvements $\delta F1$ are shown for different depths of the DT.
  • Figure 5: Visualizations of different uncertainty metrics. (a) shows a 4-label dataset on which linear SVM is learned. (b), (c), (d) visualize uncertainty scores based on different metrics, as per the linear SVM, where darker shades imply higher scores.
  • ...and 6 more figures