Table of Contents
Fetching ...

Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples

Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

This work tackles the challenge of explaining deepfake detectors by addressing perturbation-induced distribution shifts that can undermine traditional methods. It introduces adversarially-generated samples, produced via Natural Evolution Strategies, to form perturbation masks, replacing image regions with corresponding content from adversarial instances to avoid out-of-distribution issues. The approach is integrated into four perturbation-based explainers (LIME, SHAP, SOBOL, RISE) and evaluated on FaceForensics++ using a quantitative framework that measures detector accuracy drops and region sufficiency, complemented by qualitative visualizations. Results show generally positive gains in explanation quality across methods, with LIME$_{adv}$ performing best overall, though at the cost of higher computational requirements. These findings suggest a more faithful and interpretable localization of manipulated regions, with practical implications for forensic analysis and trust in deepfake detectors.

Abstract

In this paper, we introduce the idea of using adversarially-generated samples of the input images that were classified as deepfakes by a detector, to form perturbation masks for inferring the importance of different input features and produce visual explanations. We generate these samples based on Natural Evolution Strategies, aiming to flip the original deepfake detector's decision and classify these samples as real. We apply this idea to four perturbation-based explanation methods (LIME, SHAP, SOBOL and RISE) and evaluate the performance of the resulting modified methods using a SOTA deepfake detection model, a benchmarking dataset (FaceForensics++) and a corresponding explanation evaluation framework. Our quantitative assessments document the mostly positive contribution of the proposed perturbation approach in the performance of explanation methods. Our qualitative analysis shows the capacity of the modified explanation methods to demarcate the manipulated image regions more accurately, and thus to provide more useful explanations.

Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples

TL;DR

This work tackles the challenge of explaining deepfake detectors by addressing perturbation-induced distribution shifts that can undermine traditional methods. It introduces adversarially-generated samples, produced via Natural Evolution Strategies, to form perturbation masks, replacing image regions with corresponding content from adversarial instances to avoid out-of-distribution issues. The approach is integrated into four perturbation-based explainers (LIME, SHAP, SOBOL, RISE) and evaluated on FaceForensics++ using a quantitative framework that measures detector accuracy drops and region sufficiency, complemented by qualitative visualizations. Results show generally positive gains in explanation quality across methods, with LIME performing best overall, though at the cost of higher computational requirements. These findings suggest a more faithful and interpretable localization of manipulated regions, with practical implications for forensic analysis and trust in deepfake detectors.

Abstract

In this paper, we introduce the idea of using adversarially-generated samples of the input images that were classified as deepfakes by a detector, to form perturbation masks for inferring the importance of different input features and produce visual explanations. We generate these samples based on Natural Evolution Strategies, aiming to flip the original deepfake detector's decision and classify these samples as real. We apply this idea to four perturbation-based explanation methods (LIME, SHAP, SOBOL and RISE) and evaluate the performance of the resulting modified methods using a SOTA deepfake detection model, a benchmarking dataset (FaceForensics++) and a corresponding explanation evaluation framework. Our quantitative assessments document the mostly positive contribution of the proposed perturbation approach in the performance of explanation methods. Our qualitative analysis shows the capacity of the modified explanation methods to demarcate the manipulated image regions more accurately, and thus to provide more useful explanations.

Paper Structure

This paper contains 10 sections, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The original image (a) and the perturbed instances of it after: occluding certain segments (b), replacing them with fixed values (c), blurring them (d), adding Gaussian noise (e), and (f) replacing them with the corresponding segments of an adversarially-generated sample of the input image (Original face image (a) source: FaceForensics++ dataset).
  • Figure 2: The processing pipeline of the proposed explanation approach; dashed lines indicate iterative processes. (Input Deepfake Image source: FaceForensics++ dataset).
  • Figure 3: Top: Example of a traditional perturbation approach. Bottom: Illustration of the proposed perturbation approach that uses the adversarially-generated sample of the input image.
  • Figure 4: The observed detection accuracy for the different explanation methods and their modified versions. The lower the accuracy, the higher the ability of the explanation method to spot the most important image regions for the deepfake detector's decisions.
  • Figure 5: The obtained explanations per type of manipulation (displayed using the default visualization format of each explanation method).