Table of Contents
Fetching ...

Minimal Data, Maximum Clarity: A Heuristic for Explaining Optimization

Amirali Rayegan, Tim Menzies

TL;DR

EZR is introduced, a novel and modular framework for multi-objective optimization that unifies active sampling, learning, and explanation within a single, lightweight pipeline and provides clear, cohort-based rationales that surpass standard attribution-based explainable AI methods in clarity and utility.

Abstract

Efficient, interpretable optimization is a critical but underexplored challenge in software engineering, where practitioners routinely face vast configuration spaces and costly, error-prone labeling processes. This paper introduces EZR, a novel and modular framework for multi-objective optimization that unifies active sampling, learning, and explanation within a single, lightweight pipeline. Departing from conventional wisdom, our Maximum Clarity Heuristic demonstrates that using less (but more informative) data can yield optimization models that are both effective and deeply understandable. EZR employs an active learning strategy based on Naive Bayes sampling to efficiently identify high-quality configurations with a fraction of the labels required by fully supervised approaches. It then distills optimization logic into concise decision trees, offering transparent, actionable explanations for both global and local decision-making. Extensive experiments across 60 real-world datasets establish that EZR reliably achieves over 90% of the best-known optimization performance in most cases, while providing clear, cohort-based rationales that surpass standard attribution-based explainable AI (XAI) methods (LIME, SHAP, BreakDown) in clarity and utility. These results endorse "less but better"; it is both possible and often preferable to use fewer (but more informative) examples to generate label-efficient optimization and explanations in software systems. To support transparency and reproducibility, all code and experimental materials are publicly available at https://github.com/amiiralii/Minimal-Data-Maximum-Clarity.

Minimal Data, Maximum Clarity: A Heuristic for Explaining Optimization

TL;DR

EZR is introduced, a novel and modular framework for multi-objective optimization that unifies active sampling, learning, and explanation within a single, lightweight pipeline and provides clear, cohort-based rationales that surpass standard attribution-based explainable AI methods in clarity and utility.

Abstract

Efficient, interpretable optimization is a critical but underexplored challenge in software engineering, where practitioners routinely face vast configuration spaces and costly, error-prone labeling processes. This paper introduces EZR, a novel and modular framework for multi-objective optimization that unifies active sampling, learning, and explanation within a single, lightweight pipeline. Departing from conventional wisdom, our Maximum Clarity Heuristic demonstrates that using less (but more informative) data can yield optimization models that are both effective and deeply understandable. EZR employs an active learning strategy based on Naive Bayes sampling to efficiently identify high-quality configurations with a fraction of the labels required by fully supervised approaches. It then distills optimization logic into concise decision trees, offering transparent, actionable explanations for both global and local decision-making. Extensive experiments across 60 real-world datasets establish that EZR reliably achieves over 90% of the best-known optimization performance in most cases, while providing clear, cohort-based rationales that surpass standard attribution-based explainable AI (XAI) methods (LIME, SHAP, BreakDown) in clarity and utility. These results endorse "less but better"; it is both possible and often preferable to use fewer (but more informative) examples to generate label-efficient optimization and explanations in software systems. To support transparency and reproducibility, all code and experimental materials are publicly available at https://github.com/amiiralii/Minimal-Data-Maximum-Clarity.

Paper Structure

This paper contains 30 sections, 7 equations, 12 figures, 10 tables, 2 algorithms.

Figures (12)

  • Figure 1: EZR generates a very small tree summarizing the major points of the data. Leaf labels denote cohorts. For a larger example, see Figure \ref{['fig:ezr-tree-output']}
  • Figure 2: EZR pipeline
  • Figure 3: Among 60 datasets, how many times each method has statistically better performance, using 50 labels?
  • Figure 4: Among 60 datasets, how many times each method has statistically better performance, using 200 labels?
  • Figure 5: COC1000 Feature importance via permutation
  • ...and 7 more figures