DifAttack++: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross-Domain
Jun Liu, Jiantao Zhou, Jiandian Zeng, Jinyu Tian, Zheng Li
TL;DR
DifAttack++ addresses the challenge of efficient score-based black-box adversarial attacks by introducing a hierarchical disentangled feature space that separately encodes an adversarial feature (AF) and a visual feature (VF). The method trains two cross-domain autoencoders via a Hierarchical Decouple-Fusion module to disentangle AFs and VFs and facilitate image reconstruction, then optimizes AFs under query feedback while keeping VFs fixed to craft adversarial examples. Empirical results across close-set/open-set tasks—including ImageNet, Food101, ObjectNet, CLIP models, and a real-world Imagga API—demonstrate superior Attack Success Rate (ASR) and reduced query counts compared to state-of-the-art baselines, while preserving high visual quality. The work also provides extensive ablations and analyses showing the disentanglement decouples adversarial capability from appearance, enabling robust performance against defenses and in cross-domain scenarios with unknown training data.
Abstract
This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (\textbf{ASR}) and good generalizability. We design a novel attack method based on a hierarchical DIsentangled Feature space, called \textbf{DifAttack++}, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack++ firstly disentangles an image's latent feature into an Adversarial Feature (\textbf{AF}) and a Visual Feature (\textbf{VF}) via an autoencoder equipped with our specially designed Hierarchical Decouple-Fusion (\textbf{HDF}) module, where the AF dominates the adversarial capability of an image, while the VF largely determines its visual appearance. We train such two autoencoders for the clean and adversarial image domains (i.e., cross-domain) respectively to achieve image reconstructions and feature disentanglement, by using pairs of clean images and their Adversarial Examples (\textbf{AE}s) generated from available surrogate models via white-box attack methods. Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes the AF according to the query feedback from the victim model until a successful AE is generated, while keeping the VF unaltered. Extensive experimental results demonstrate that our DifAttack++ leads to superior ASR and query efficiency than state-of-the-art methods, meanwhile exhibiting much better visual quality of AEs. The code is available at https://github.com/csjunjun/DifAttack.git.
