Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples
Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris
TL;DR
This work tackles the challenge of explaining deepfake detectors by addressing perturbation-induced distribution shifts that can undermine traditional methods. It introduces adversarially-generated samples, produced via Natural Evolution Strategies, to form perturbation masks, replacing image regions with corresponding content from adversarial instances to avoid out-of-distribution issues. The approach is integrated into four perturbation-based explainers (LIME, SHAP, SOBOL, RISE) and evaluated on FaceForensics++ using a quantitative framework that measures detector accuracy drops and region sufficiency, complemented by qualitative visualizations. Results show generally positive gains in explanation quality across methods, with LIME$_{adv}$ performing best overall, though at the cost of higher computational requirements. These findings suggest a more faithful and interpretable localization of manipulated regions, with practical implications for forensic analysis and trust in deepfake detectors.
Abstract
In this paper, we introduce the idea of using adversarially-generated samples of the input images that were classified as deepfakes by a detector, to form perturbation masks for inferring the importance of different input features and produce visual explanations. We generate these samples based on Natural Evolution Strategies, aiming to flip the original deepfake detector's decision and classify these samples as real. We apply this idea to four perturbation-based explanation methods (LIME, SHAP, SOBOL and RISE) and evaluate the performance of the resulting modified methods using a SOTA deepfake detection model, a benchmarking dataset (FaceForensics++) and a corresponding explanation evaluation framework. Our quantitative assessments document the mostly positive contribution of the proposed perturbation approach in the performance of explanation methods. Our qualitative analysis shows the capacity of the modified explanation methods to demarcate the manipulated image regions more accurately, and thus to provide more useful explanations.
