Leveraging Generalizability of Image-to-Image Translation for Enhanced Adversarial Defense
Haibo Zhang, Zhihua Yao, Kouichi Sakurai, Takeshi Saitoh
TL;DR
This paper addresses adversarial vulnerabilities in deep neural networks by proposing a generalizable defense based on image-to-image translation enhanced with residual blocks. The method uses a conditional GAN with a U-Net–style generator and a PatchGAN discriminator, optimized by a combined loss $\mathcal{L}_{cGAN} + \lambda_1 \mathcal{L}_{L1} + \lambda_2 \mathcal{L}_{perc}$, and is evaluated on MNIST, Fashion-MNIST, CIFAR-10, and ImageNet across multiple attacks. It demonstrates that training on a composite set of attacks yields strong cross-attack robustness and transferability across target models, with high restoration accuracy and competitive image quality (PSNR) while maintaining efficiency. Ablation studies show seven residual blocks offer the best trade-off between performance and training time. The work suggests practical, scalable defenses against evolving adversarial threats, with potential deployment in safety-critical applications and future work exploring printed adversarial examples.
Abstract
In the rapidly evolving field of artificial intelligence, machine learning emerges as a key technology characterized by its vast potential and inherent risks. The stability and reliability of these models are important, as they are frequent targets of security threats. Adversarial attacks, first rigorously defined by Ian Goodfellow et al. in 2013, highlight a critical vulnerability: they can trick machine learning models into making incorrect predictions by applying nearly invisible perturbations to images. Although many studies have focused on constructing sophisticated defensive mechanisms to mitigate such attacks, they often overlook the substantial time and computational costs of training and maintaining these models. Ideally, a defense method should be able to generalize across various, even unseen, adversarial attacks with minimal overhead. Building on our previous work on image-to-image translation-based defenses, this study introduces an improved model that incorporates residual blocks to enhance generalizability. The proposed method requires training only a single model, effectively defends against diverse attack types, and is well-transferable between different target models. Experiments show that our model can restore the classification accuracy from near zero to an average of 72\% while maintaining competitive performance compared to state-of-the-art methods.
