Table of Contents
Fetching ...

Concept-Based Masking: A Patch-Agnostic Defense Against Adversarial Patch Attacks

Ayushi Mehrotra, Derek Peng, Dipkamal Bhusal, Nidhi Rastogi

TL;DR

This work tackles adversarial patch attacks without assuming prior patch size or location by introducing a patch-agnostic defense grounded in concept-based explanations. Using CRAFT to extract and score concept activation vectors via recursive non-negative matrix factorization and Sobol indices, the method suppresses patch effects by masking regions associated with the top concepts through a pixel-precise blur, producing a defended input $\mathbb{D}(f, \boldsymbol{x})$. Experiments on Imagenette with ResNet-50 show the approach yields higher robust accuracy and better clean accuracy than PatchCleanser across 1–3% patch sizes, underscoring the value of combining interpretability with robustness. The results suggest a scalable direction for securing models against adversarial patches in real-world settings, with future work addressing adaptive attacks and broader datasets.

Abstract

Adversarial patch attacks pose a practical threat to deep learning models by forcing targeted misclassifications through localized perturbations, often realized in the physical world. Existing defenses typically assume prior knowledge of patch size or location, limiting their applicability. In this work, we propose a patch-agnostic defense that leverages concept-based explanations to identify and suppress the most influential concept activation vectors, thereby neutralizing patch effects without explicit detection. Evaluated on Imagenette with a ResNet-50, our method achieves higher robust and clean accuracy than the state-of-the-art PatchCleanser, while maintaining strong performance across varying patch sizes and locations. Our results highlight the promise of combining interpretability with robustness and suggest concept-driven defenses as a scalable strategy for securing machine learning models against adversarial patch attacks.

Concept-Based Masking: A Patch-Agnostic Defense Against Adversarial Patch Attacks

TL;DR

This work tackles adversarial patch attacks without assuming prior patch size or location by introducing a patch-agnostic defense grounded in concept-based explanations. Using CRAFT to extract and score concept activation vectors via recursive non-negative matrix factorization and Sobol indices, the method suppresses patch effects by masking regions associated with the top concepts through a pixel-precise blur, producing a defended input . Experiments on Imagenette with ResNet-50 show the approach yields higher robust accuracy and better clean accuracy than PatchCleanser across 1–3% patch sizes, underscoring the value of combining interpretability with robustness. The results suggest a scalable direction for securing models against adversarial patches in real-world settings, with future work addressing adaptive attacks and broader datasets.

Abstract

Adversarial patch attacks pose a practical threat to deep learning models by forcing targeted misclassifications through localized perturbations, often realized in the physical world. Existing defenses typically assume prior knowledge of patch size or location, limiting their applicability. In this work, we propose a patch-agnostic defense that leverages concept-based explanations to identify and suppress the most influential concept activation vectors, thereby neutralizing patch effects without explicit detection. Evaluated on Imagenette with a ResNet-50, our method achieves higher robust and clean accuracy than the state-of-the-art PatchCleanser, while maintaining strong performance across varying patch sizes and locations. Our results highlight the promise of combining interpretability with robustness and suggest concept-driven defenses as a scalable strategy for securing machine learning models against adversarial patch attacks.

Paper Structure

This paper contains 15 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Defense results grouped by patch size (columns) and examples (rows). All images are scaled uniformly for easier comparison.