Backward Compatibility in Attributive Explanation and Enhanced Model Training Method
Ryuta Matsuno
TL;DR
This work introduces BCX, a quantitative metric for backward compatibility of attribution explanations across pre- and post-update models, and BCXR, a differentiable retraining framework that aims to maximize BCX while preserving predictive performance. BCX operates on samples where both models predict correctly and uses top-k explanation agreement metrics, with SHAP as a primary explanation method. To enable optimization, the authors derive differentiable surrogate losses that bound non-differentiable agreement measures, and they present a universal BCXR variant based on The Euclidean Norm distance between explanations. Empirical results across eight LIBSVM-derived datasets show that BCXR, especially the Norm-based variant, achieves favorable trade-offs between BCX scores and predictive loss, often outperforming existing BTC-aware retraining methods and demonstrating improved explanation stability in high-dimensional settings.
Abstract
Model update is a crucial process in the operation of ML/AI systems. While updating a model generally enhances the average prediction performance, it also significantly impacts the explanations of predictions. In real-world applications, even minor changes in explanations can have detrimental consequences. To tackle this issue, this paper introduces BCX, a quantitative metric that evaluates the backward compatibility of feature attribution explanations between pre- and post-update models. BCX utilizes practical agreement metrics to calculate the average agreement between the explanations of pre- and post-update models, specifically among samples on which both models accurately predict. In addition, we propose BCXR, a BCX-aware model training method by designing surrogate losses which theoretically lower bounds agreement scores. Furthermore, we present a universal variant of BCXR that improves all agreement metrics, utilizing L2 distance among the explanations of the models. To validate our approach, we conducted experiments on eight real-world datasets, demonstrating that BCXR achieves superior trade-offs between predictive performances and BCX scores, showcasing the effectiveness of our BCXR methods.
