Data-free Defense of Black Box Models Against Adversarial Attacks

Gaurav Kumar Nayak; Inder Khatri; Ruchit Rawal; Anirban Chakraborty

Data-free Defense of Black Box Models Against Adversarial Attacks

Gaurav Kumar Nayak, Inder Khatri, Ruchit Rawal, Anirban Chakraborty

TL;DR

This work tackles defending black-box neural networks against adversarial attacks in a data-free setting. It introduces DBMA, a defense that first steals a surrogate model to generate synthetic data, then uses a Wavelet Noise Remover (WNR) with a Wavelet Coefficient Selection Module (WCSM) to prune corrupted high-frequency content, followed by a U-Net based Regenerator network (R_n) to recover lost information and align surrogate-model features with the original data. The combination of WNR and R_n, trained with a triplet loss structure ($L_{cs}$, $L_{kl}$, $L_{sc}$), significantly boosts adversarial accuracy on CIFAR-10 and SVHN across BIM, PGD, and Auto Attack, even when attackers use matching model-stealing strategies. While clean accuracy incurs a modest drop, DBMA outperforms existing data-free defenses (SIT, RDG) and demonstrates robustness across surrogate architectures and larger black-box models, highlighting practical applicability when training data or weights are unavailable.

Abstract

Several companies often safeguard their trained deep models (i.e., details of architecture, learnt weights, training details etc.) from third-party users by exposing them only as black boxes through APIs. Moreover, they may not even provide access to the training data due to proprietary reasons or sensitivity concerns. In this work, we propose a novel defense mechanism for black box models against adversarial attacks in a data-free set up. We construct synthetic data via generative model and train surrogate network using model stealing techniques. To minimize adversarial contamination on perturbed samples, we propose 'wavelet noise remover' (WNR) that performs discrete wavelet decomposition on input images and carefully select only a few important coefficients determined by our 'wavelet coefficient selection module' (WCSM). To recover the high-frequency content of the image after noise removal via WNR, we further train a 'regenerator' network with an objective to retrieve the coefficients such that the reconstructed image yields similar to original predictions on the surrogate model. At test time, WNR combined with trained regenerator network is prepended to the black box network, resulting in a high boost in adversarial accuracy. Our method improves the adversarial accuracy on CIFAR-10 by 38.98% and 32.01% on state-of-the-art Auto Attack compared to baseline, even when the attacker uses surrogate architecture (Alexnet-half and Alexnet) similar to the black box architecture (Alexnet) with same model stealing strategy as defender. The code is available at https://github.com/vcl-iisc/data-free-black-box-defense

Data-free Defense of Black Box Models Against Adversarial Attacks

TL;DR

), significantly boosts adversarial accuracy on CIFAR-10 and SVHN across BIM, PGD, and Auto Attack, even when attackers use matching model-stealing strategies. While clean accuracy incurs a modest drop, DBMA outperforms existing data-free defenses (SIT, RDG) and demonstrates robustness across surrogate architectures and larger black-box models, highlighting practical applicability when training data or weights are unavailable.

Abstract

Paper Structure (25 sections, 2 equations, 7 figures, 14 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 7 figures, 14 tables, 1 algorithm.

Introduction
Related Works
Preliminaries
Proposed Approach
Obtain Proxy Model and Synthetic Data
Noise Removal with Wavelet Coefficient Selection Module (WCSM)
Training of Regenerator Network
Experiments
Ablation on quantity of coefficients
Effect of wavelet noise remover with WCSM
Ablation on losses
Comparison with existing Data and Training efficient defense methods
Conclusion
Ablation on coefficient selection strategy
Performance of our method (DBMA) using different wavelets
...and 10 more sections

Figures (7)

Figure 1: The average absolute magnitude of approximate (LL) and detail coefficients (LH, HL and HH) (via wavelet decomposition) across samples on a) clean data and b) normalized difference between wavelet decomposition of clean and corresponding adversarial image. In (a) the lesser contaminated LL coefficients have higher magnitude. In (b) LL are least affected.
Figure 1: Performance of our approach DBMA for different model stealing methods used to get the attacker’s surrogate model for data-free black box attacks. DBMA consistently improves performance against different attacks across all model stealing methods.
Figure 2: An overview of our proposed approach DBMA. In step $1$, we obtain the defender's surrogate model $S^d_m$ and synthetic data $S_d$ by model stealing from the victim model $B_m$. In step $2$, we use the Wavelet Coefficient Selection Module (WCSM) that gives the optimal % of coefficients ($\hat{k}$) to be selected by the Wavelet Noise Remover (WNR) which are likely to be least corrupted by adversarial attacks. In step $3$, we train a regenerator network $R_{n}$ using different losses ($L_{cs}$, $L_{kl}$, $L_{sc}$) such that the model $S^d_m$ yields features on the regenerated data (clean $R_n(\bar{S}_{d}^{k})$ and adversarial $R_n(\bar{S}_{da}^{k})$) similar to the features on clean data $S_d$. Finally in step $4$, we evaluate our DBMA approach on test clean ($O_{d}^{test}$) and adversarial samples ($O_{da}^{test}$) where the WNR (with $k=\hat{k}$) and trained $R_n$ are prepended to $B_m$.
Figure 2: Visualization of images: The top row indicates input as clean image and bottom row corresponds to adversarial image. The predictions obtained by the black-box network on inputs: (a) Original clean image (b) Output of wavelet noise remover on clean image (c) Output of WNR with regenerator network ($R_{n}$) on clean image (d) Original adversarial image (e) Output of wavelet noise remover (WNR) on adversarial image (f) Output of WNR with regenerator network ($R_n$) on adversarial Image. Here, the ground truth class is Cat. Our method (DBMA) produces correct output using regenerated image as input.
Figure 3: Label consistency rates ($LCR_A$ , $LCR_C$ and $LCR$) vs detail coefficients ($k\%$) plotted using prediction from black-box model $B_{m}$ on Cifar-$10$ Dataset.
...and 2 more figures

Data-free Defense of Black Box Models Against Adversarial Attacks

TL;DR

Abstract

Data-free Defense of Black Box Models Against Adversarial Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)