Backward Compatibility in Attributive Explanation and Enhanced Model Training Method

Ryuta Matsuno

Backward Compatibility in Attributive Explanation and Enhanced Model Training Method

Ryuta Matsuno

TL;DR

This work introduces BCX, a quantitative metric for backward compatibility of attribution explanations across pre- and post-update models, and BCXR, a differentiable retraining framework that aims to maximize BCX while preserving predictive performance. BCX operates on samples where both models predict correctly and uses top-k explanation agreement metrics, with SHAP as a primary explanation method. To enable optimization, the authors derive differentiable surrogate losses that bound non-differentiable agreement measures, and they present a universal BCXR variant based on The Euclidean Norm distance between explanations. Empirical results across eight LIBSVM-derived datasets show that BCXR, especially the Norm-based variant, achieves favorable trade-offs between BCX scores and predictive loss, often outperforming existing BTC-aware retraining methods and demonstrating improved explanation stability in high-dimensional settings.

Abstract

Model update is a crucial process in the operation of ML/AI systems. While updating a model generally enhances the average prediction performance, it also significantly impacts the explanations of predictions. In real-world applications, even minor changes in explanations can have detrimental consequences. To tackle this issue, this paper introduces BCX, a quantitative metric that evaluates the backward compatibility of feature attribution explanations between pre- and post-update models. BCX utilizes practical agreement metrics to calculate the average agreement between the explanations of pre- and post-update models, specifically among samples on which both models accurately predict. In addition, we propose BCXR, a BCX-aware model training method by designing surrogate losses which theoretically lower bounds agreement scores. Furthermore, we present a universal variant of BCXR that improves all agreement metrics, utilizing L2 distance among the explanations of the models. To validate our approach, we conducted experiments on eight real-world datasets, demonstrating that BCXR achieves superior trade-offs between predictive performances and BCX scores, showcasing the effectiveness of our BCXR methods.

Backward Compatibility in Attributive Explanation and Enhanced Model Training Method

TL;DR

Abstract

Paper Structure (25 sections, 5 theorems, 34 equations, 4 figures, 1 table)

This paper contains 25 sections, 5 theorems, 34 equations, 4 figures, 1 table.

Preliminary
Notation
Related works
Backward compatibility in ML
Explanation methods in ML
Disagreement measures of attributive explanations
Proposed method
Backward compatibility in explanations
BCX-aware retraining
Universal BCXR
Experiments
Data set
Setting
Results
Results for regression tasks.
...and 10 more sections

Key Result

Lemma 1

The following inequality holds for any $\bm{{e}}_1, \bm{{e}}_2$, and $k$.

Figures (4)

Figure 1: Trade-off for regression data sets. Horizontal axes represents MSE (the lower the better, $\leftarrow$) and vertical axes represents each of BTC and BCXs with different agreement metrics (the higher the better, $\uparrow$). In general, points located in the upper left region of each figure indicate better results compared to points in the lower right region. The grey dashed vertical lines indicate the MSE achieved by old models. The pink dashed vertical and horizontal lines represent the MSE and backward compatibility scores achieved by the ERM. Retraining methods that take backward compatibility into account are expected to perform better MSE than the old models (up to the grey dashed lines) and better compatibility than ERM (up to the pink horizontal lines). Since this is a multi-objective optimization problem, the results on the Pareto fronts are considered effective in finding better trade-offs between MSE and backward compatibility scores.
Figure 2: Trade-off for classification data sets. Explanation of figures follow Figure \ref{['fig:tradeoff-regression']}.
Figure B.1: Sensitivity plot of $\lambda$ for regression data sets.
Figure B.2: Sensitivity plot of $\lambda$ for classification data sets.

Theorems & Definitions (7)

Definition 1: Backward Compatibility in eXplanations
Lemma 1
Definition 2: Feature-agreement-based BCX-aware Retraining (BCXR-Ftr)
Lemma 2
Lemma 3
Lemma 4
Lemma 5

Backward Compatibility in Attributive Explanation and Enhanced Model Training Method

TL;DR

Abstract

Backward Compatibility in Attributive Explanation and Enhanced Model Training Method

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)