SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models
Mingsi Wang, Shuaiyin Yao, Chang Yue, Lijie Zhang, Guozhu Meng
TL;DR
This work tackles the robustness of face recognition systems against impersonation attacks by introducing SAP-DIFF, a diffusion-model–driven framework that conducts semantic, latent-space patch perturbations rather than pixel-level edits. The method initializes patches in a diffusion latent space via DDIM Inversion, then optimizes them with three losses—attention disruption, directional guidance toward the target identity, and an adversarial cosine loss with UV-based patch placement—to achieve targeted impersonation in a query-based black-box setting. Extensive experiments on LFW and CelebA-HQ across ArcFace, CosFace, and FaceNet show SAP-DIFF delivering large gains in attack success rate (average improvement around $45.66\%$) while reducing the number of target-model queries by about $40\%$ compared to current SOTA methods; universality and ablation studies further validate the contributions. The results underscore the potential vulnerability of FR systems to semantically informed, diffusion-guided patches and highlight the need for defenses that address latent-space attacks and cross-model transferability in real-world deployments.
Abstract
Given the need to evaluate the robustness of face recognition (FR) models, many efforts have focused on adversarial patch attacks that mislead FR models by introducing localized perturbations. Impersonation attacks are a significant threat because adversarial perturbations allow attackers to disguise themselves as legitimate users. This can lead to severe consequences, including data breaches, system damage, and misuse of resources. However, research on such attacks in FR remains limited. Existing adversarial patch generation methods exhibit limited efficacy in impersonation attacks due to (1) the need for high attacker capabilities, (2) low attack success rates, and (3) excessive query requirements. To address these challenges, we propose a novel method SAP-DIFF that leverages diffusion models to generate adversarial patches via semantic perturbations in the latent space rather than direct pixel manipulation. We introduce an attention disruption mechanism to generate features unrelated to the original face, facilitating the creation of adversarial samples and a directional loss function to guide perturbations toward the target identity feature space, thereby enhancing attack effectiveness and efficiency. Extensive experiments on popular FR models and datasets demonstrate that our method outperforms state-of-the-art approaches, achieving an average attack success rate improvement of 45.66% (all exceeding 40%), and a reduction in the number of queries by about 40% compared to the SOTA approach
