Table of Contents
Fetching ...

Unique Rashomon Sets for Robust Active Learning

Simon Nguyen, Kentaro Hoffman, Tyler McCormick

TL;DR

This work tackles the data-labeling bottleneck in active learning by restricting ensemble diversity to the Rashomon set of near-optimal models, thereby distinguishing genuine uncertainty from noise. It introduces UNREAL, which constructs a unique-pattern Rashomon ensemble from decision trees via TreeFarms and grouping, improving uncertainty signals for query selection. Empirical results across five datasets show up to 20% accuracy gains and improved robustness to label noise, with a concise, interpretable ensemble. The study also discusses selecting the Rashomon threshold $\epsilon$ and outlines avenues for dynamic adjustment and broader Rashomon-based frameworks in future work.

Abstract

Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.

Unique Rashomon Sets for Robust Active Learning

TL;DR

This work tackles the data-labeling bottleneck in active learning by restricting ensemble diversity to the Rashomon set of near-optimal models, thereby distinguishing genuine uncertainty from noise. It introduces UNREAL, which constructs a unique-pattern Rashomon ensemble from decision trees via TreeFarms and grouping, improving uncertainty signals for query selection. Empirical results across five datasets show up to 20% accuracy gains and improved robustness to label noise, with a concise, interpretable ensemble. The study also discusses selecting the Rashomon threshold and outlines avenues for dynamic adjustment and broader Rashomon-based frameworks in future work.

Abstract

Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: A depiction of the redundant and unique ensembling of Rashomon trees from the COMPAS dataset. To visualize our methods, the figure plots the misclassification errors by the ordered indices of the tree (though the unique selection of trees was grouped by classification pattern in our simulations). As shown, many trees have the same misclassification rate, indicating these trees share the same classification pattern. The geometry of the trees can be seen in Figure \ref{['fig:TreeGeometry']} of the appendix.
  • Figure 2: This plot graphs the ensemble test classification error against the Rashomon threshold. In this case, the optimal threshold is $0.016$.
  • Figure 3: Performance of the four active learning procedures (left) on our five benchmark datasets. The plot presents the errors relative to random forests.
  • Figure 4: The number of total vs. unique trees on a logarithmic scale.
  • Figure 5: Geometry of the decision trees from Figure \ref{['fig:ErrorbyTreeIndex_Grouped']}. Note that these near-optimal trees, although sharing similar similar accuracy, differ in specific regions of the feature space, namely how to split the feature prior. In the UNREAL algorithm, these trees would lead to a total of 4 members in the committee consisting of the four unique classification patterns. In DUREAL, the committee would be composed of all nine trees.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 2.1: Rashomon set