Table of Contents
Fetching ...

"A 6 or a 9?": Ensemble Learning Through the Multiplicity of Performant Models and Explanations

Gianlucca Zuin, Adriano Veloso

TL;DR

This work tackles model selection under distribution shift by leveraging the Rashomon Effect to form Rashomon Ensembles that sample diverse, high-performing explanations from the Rashomon Set $R(H,\epsilon)$. By clustering models on SHAP-based explanations, perturbing test data, and selecting representative constituents, the approach builds robust ensembles whose production risk is quantifiable via the Jensen–Shannon distance. Across open benchmarks and four real-world collaborations, the method yields improved AUROC under drift and provides actionable insights into feature-subspace structure, with notable business impact in steel manufacturing, healthcare, and energy domains. The study also discusses data drift detection via ensemble disagreement, the importance of diverse explanations for trust, and practical deployment considerations, including expert involvement and patentable aspects. The Rashomon Ratio and intra-cluster stability emerge as central factors governing ensemble gains and reliability in production environments.

Abstract

Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions. We propose the Rashomon Ensemble, a method that strategically selects models from these diverse high-performing solutions to improve generalization. By grouping models based on both their performance and explanations, we construct ensembles that maximize diversity while maintaining predictive accuracy. This selection ensures that each model covers a distinct region of the solution space, making the ensemble more robust to distribution shifts and variations in unseen data. We validate our approach on both open and proprietary collaborative real-world datasets, demonstrating up to 0.20+ AUROC improvements in scenarios where the Rashomon ratio is large. Additionally, we demonstrate tangible benefits for businesses in various real-world applications, highlighting the robustness, practicality, and effectiveness of our approach.

"A 6 or a 9?": Ensemble Learning Through the Multiplicity of Performant Models and Explanations

TL;DR

This work tackles model selection under distribution shift by leveraging the Rashomon Effect to form Rashomon Ensembles that sample diverse, high-performing explanations from the Rashomon Set . By clustering models on SHAP-based explanations, perturbing test data, and selecting representative constituents, the approach builds robust ensembles whose production risk is quantifiable via the Jensen–Shannon distance. Across open benchmarks and four real-world collaborations, the method yields improved AUROC under drift and provides actionable insights into feature-subspace structure, with notable business impact in steel manufacturing, healthcare, and energy domains. The study also discusses data drift detection via ensemble disagreement, the importance of diverse explanations for trust, and practical deployment considerations, including expert involvement and patentable aspects. The Rashomon Ratio and intra-cluster stability emerge as central factors governing ensemble gains and reliability in production environments.

Abstract

Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions. We propose the Rashomon Ensemble, a method that strategically selects models from these diverse high-performing solutions to improve generalization. By grouping models based on both their performance and explanations, we construct ensembles that maximize diversity while maintaining predictive accuracy. This selection ensures that each model covers a distinct region of the solution space, making the ensemble more robust to distribution shifts and variations in unseen data. We validate our approach on both open and proprietary collaborative real-world datasets, demonstrating up to 0.20+ AUROC improvements in scenarios where the Rashomon ratio is large. Additionally, we demonstrate tangible benefits for businesses in various real-world applications, highlighting the robustness, practicality, and effectiveness of our approach.

Paper Structure

This paper contains 18 sections, 12 equations, 21 figures, 2 tables, 1 algorithm.

Figures (21)

  • Figure 1: TSNE reduction of the Rashomon space and optimal silhouette for each subgroup.
  • Figure 2: Similarity to a reference model found from running the Rashomon and baseline pipelines, filtering models with statistically worse performance than the proposed threshold. MAGIC dataset.
  • Figure 3: Dependence between relevant base models in the MAGIC Rashomon ensemble.
  • Figure 4: T-SNE visualization of the sampled Rashomon space for models trained on the steel plate defects problem. Each point represents a model. Models are placed according to the defect explanations assigned to each steel plate so that models that possess similar SHAP values are placed next to each other in space. The color indicates the cluster for which the model was assigned.
  • Figure 5: Comparison of different algorithms to our approach in the steel manufacturing defects problem. Even when employing the clusteroid ensemble, in which most constituents are underperforming, our approach exceeds other state-of-the-art results.
  • ...and 16 more figures