Table of Contents
Fetching ...

Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

Farhin Farhad Riya, Shahinul Hoque, Jinyuan Stella Sun, Olivera Kotevska

TL;DR

This paper addresses the risk that interpretability can be corrupted in Federated Learning without sacrificing predictive accuracy. It introduces the Chromatic Perturbation Module (CPM), a saliency-aware color transformation that degrades Grad-CAM explanations while preserving predictions, and shows accumulation over FL rounds. Across multiple datasets, CPM reduces saliency fidelity and peak overlap while keeping accuracy above 95%. The work highlights interpretability as an attack surface in FL and motivates defenses that explicitly monitor and preserve explanation fidelity during training.

Abstract

As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model's saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model's internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.

Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

TL;DR

This paper addresses the risk that interpretability can be corrupted in Federated Learning without sacrificing predictive accuracy. It introduces the Chromatic Perturbation Module (CPM), a saliency-aware color transformation that degrades Grad-CAM explanations while preserving predictions, and shows accumulation over FL rounds. Across multiple datasets, CPM reduces saliency fidelity and peak overlap while keeping accuracy above 95%. The work highlights interpretability as an attack surface in FL and motivates defenses that explicitly monitor and preserve explanation fidelity during training.

Abstract

As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model's saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model's internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.

Paper Structure

This paper contains 29 sections, 1 equation, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the attack flow of an adversarial client
  • Figure 2: Attack samples generated with CPM. $\Delta E$ values quantify the perceptual color difference (CIEDE2000) between clean and attack samples.
  • Figure 3: Interpretability of clean model on Attack samples
  • Figure 4: Comparison of SSIM scores between clean and skewed models using MobileNet (a) and DenseNet121 (b) for CIFR-100 Dataset
  • Figure 5: Visualization of the accumulated distortion of Grad-CAM explanations under CPM across FL rounds
  • ...and 4 more figures