Robust Counterfactual Explanations in Machine Learning: A Survey
Junqi Jiang, Francesco Leofante, Antonio Rago, Francesca Toni
TL;DR
This survey addresses the fragility of counterfactual explanations (CEs) under changing conditions and systematically classifies robustness into four dimensions: Model Changes, Model Multiplicity, Noisy Executions, and Input Changes. It surveys methodological families—robust optimisation, class-score elevation, probabilistic modelling, and re-training—along with metrics like VaR and Delta-robustness, and discusses theoretical guarantees, computational costs, and practical limitations. The work also contrasts fixed versus pending prediction scenarios in MM, and surveys verification-based, loss-function, and probabilistic approaches for NE and IC robustness, highlighting a need for standardized benchmarks and user studies. Overall, the paper provides a comprehensive foundation for developing robust CEs with clearer guarantees, guiding both theory and practice toward more trustworthy and fair algorithmic recourse.
Abstract
Counterfactual explanations (CEs) are advocated as being ideally suited to providing algorithmic recourse for subjects affected by the predictions of machine learning models. While CEs can be beneficial to affected individuals, recent work has exposed severe issues related to the robustness of state-of-the-art methods for obtaining CEs. Since a lack of robustness may compromise the validity of CEs, techniques to mitigate this risk are in order. In this survey, we review works in the rapidly growing area of robust CEs and perform an in-depth analysis of the forms of robustness they consider. We also discuss existing solutions and their limitations, providing a solid foundation for future developments.
