Illuminating the Black Box: Real-Time Monitoring of Backdoor Unlearning in CNNs via Explainable AI
Tien Dat Hoang
TL;DR
The paper tackles the challenge of removing backdoors from CNNs in a transparent, verifiable way. It introduces a real-time monitoring framework that embeds Grad-CAM into the unlearning loop, and a Trigger Attention Ratio (TAR) metric to quantify attention shifts from triggers to legitimate object features. A balanced unlearning strategy combines gradient ascent on poisoned samples, Elastic Weight Consolidation to mitigate forgetting, and a recovery phase to restore clean accuracy, demonstrated on CIFAR-10 with BadNets where ASR drops from 96.51% to 5.52% while preserving 82.06% accuracy. This approach provides observable, interpretable evidence of backdoor removal, offering practical benefits for security practitioners and paving the way for broader adoption of explainable AI in defense workflows.
Abstract
Backdoor attacks pose severe security threats to deep neural networks by embedding malicious triggers that force misclassification. While machine unlearning techniques can remove backdoor behaviors, current methods lack transparency and real-time interpretability. This paper introduces a novel framework that integrates Gradient-weighted Class Activation Mapping (Grad-CAM) into the unlearning process to provide real-time monitoring and explainability. We propose the Trigger Attention Ratio (TAR) metric to quantitatively measure the model's attention shift from trigger patterns to legitimate object features. Our balanced unlearning strategy combines gradient ascent on backdoor samples, Elastic Weight Consolidation (EWC) for catastrophic forgetting prevention, and a recovery phase for clean accuracy restoration. Experiments on CIFAR-10 with BadNets attacks demonstrate that our approach reduces Attack Success Rate (ASR) from 96.51% to 5.52% while retaining 99.48% of clean accuracy (82.06%), achieving a 94.28% ASR reduction. The integration of explainable AI enables transparent, observable, and verifiable backdoor removal.
