Characterizing Model Robustness via Natural Input Gradients

Adrián Rodríguez-Muñoz; Tongzhou Wang; Antonio Torralba

Characterizing Model Robustness via Natural Input Gradients

Adrián Rodríguez-Muñoz, Tongzhou Wang, Antonio Torralba

TL;DR

The paper investigates how model robustness to adversarial perturbations can be understood and improved by studying natural-input gradients. It demonstrates that regularizing the $L_1$ norm of loss-input gradients on clean data yields strong robustness, particularly when using smooth activation functions, and can approach state-of-the-art adversarial robustness at substantially lower computational cost. Beyond gradient magnitudes, the authors show that aligning gradients with image edges also enhances robustness, suggesting architecture-level strategies for perceptual robustness. These findings imply that robustness can be partially achieved through training on natural inputs and thoughtful architectural choices, with practical implications for deploying robust vision systems at scale.

Abstract

Adversarially robust models are locally smooth around each data sample so that small perturbations cannot drastically change model outputs. In modern systems, such smoothness is usually obtained via Adversarial Training, which explicitly enforces models to perform well on perturbed examples. In this work, we show the surprising effectiveness of instead regularizing the gradient with respect to model inputs on natural examples only. Penalizing input Gradient Norm is commonly believed to be a much inferior approach. Our analyses identify that the performance of Gradient Norm regularization critically depends on the smoothness of activation functions, and are in fact extremely effective on modern vision transformers that adopt smooth activations over piecewise linear ones (eg, ReLU), contrary to prior belief. On ImageNet-1k, Gradient Norm training achieves > 90% the performance of state-of-the-art PGD-3 Adversarial Training} (52% vs.~56%), while using only 60% computation cost of the state-of-the-art without complex adversarial optimization. Our analyses also highlight the relationship between model robustness and properties of natural input gradients, such as asymmetric sample and channel statistics. Surprisingly, we find model robustness can be significantly improved by simply regularizing its gradients to concentrate on image edges without explicit conditioning on the gradient norm.

Characterizing Model Robustness via Natural Input Gradients

TL;DR

The paper investigates how model robustness to adversarial perturbations can be understood and improved by studying natural-input gradients. It demonstrates that regularizing the

norm of loss-input gradients on clean data yields strong robustness, particularly when using smooth activation functions, and can approach state-of-the-art adversarial robustness at substantially lower computational cost. Beyond gradient magnitudes, the authors show that aligning gradients with image edges also enhances robustness, suggesting architecture-level strategies for perceptual robustness. These findings imply that robustness can be partially achieved through training on natural inputs and thoughtful architectural choices, with practical implications for deploying robust vision systems at scale.

Abstract

Paper Structure (44 sections, 5 equations, 14 figures, 9 tables)

This paper contains 44 sections, 5 equations, 14 figures, 9 tables.

Introduction
Related Works
Adversarial examples.
Training robust models.
Perceptually aligned gradients.
Experimental Settings for Evaluating Robustness
Dataset
Architecture
Adversarial Training skyline and training recipe
Attack benchmark
Small $L_1$ Gradient Norm Makes a Model Robust
Regularizing for Small Gradient Norms
Smooth Activation Functions Make Gradient Norm Regularization Effective
Trading Off Smoothness and Performance
Beyond Small $L_1$ Norm: Other Properties of Robust Gradients
...and 29 more sections

Figures (14)

Figure 1: Comparison of loss-input gradients of non-robust and robust models across architectures for a set of images. Non-robust models are taken from the open-source repository timmwightman_pytorch_2019. Adversarial training is from the work of Liu et al.liu_comprehensive_2023. Gradient norm regularization optimizes \ref{['eq:gradnorm-objective']}. As can be seen, a model can be easily identified as vulnerable or robust simply by looking at clean input gradients. Gradients of robust models (adversarial training and gradient norm regularization) highly resemble the input images, and look visually similar to each other to the human eye. By contrast, gradients of vulnerable models are noise-like, bearing apparently little resemblance to each other or the input images. Numerically, the norm of the input gradient (top right for each gradient) is also highly discriminative of vulnerability or robustness. Gradients normalized to [0, 1] then shifted by 0.4 for visualization purposes only.
Figure 2: Robust accuracy vs epsilon for the PGD100 attack on ImageNet for Swin Transformer trained on Gradient Norm Regularization and state-of-the-art Adversarial Training. Gradient Norm Regularization achieves slightly better accuracy on clean images ($\epsilon = 0$) and good robust performance ($\epsilon > 0$), despite seeing only natural examples and having 60% of the computational cost of Adversarial Training with PGD-3. Robust accuracy for both models smoothly converges towards 0% as the adversarial strength grows.
Figure 3: Comparison of PGD10 $L_\infty (\epsilon=\frac{4}{255})$ perturbations of non-robust and robust models across architectures for a set of images (same as in \ref{['fig:gradient-comparison']}). The border is blue if the model is robust to the perturbation, and red if it is not robust. As with clean input gradients, models can again be easily identified as vulnerable or robust simply by looking at the perturbations. Perturbations coming from robust models (adversarial training and gradient norm regularization) highly resemble the input images, though the visual similarity has decreased w.r.t. the input gradients. Perturbations originating from vulnerable models are now even more noise-like, with the exception of images with very flat backgrounds, potentially because the gradient may oscillate around zero in those areas. Perturbations normalized to [0, 1] for visualization purposes only.
Figure 4: Clean and PGD10 ($\epsilon=4$) robust accuracy vs epoch for ResNet50 with ReLU and GeLU trained with Adversarial Training and Gradient Norm Regularization. We observe how the ReLU ResNet is not capable of handling the regularization objective at the appropriate strength.
Figure 5: Clean and AutoAttack (AA; $\epsilon=\frac{4}{255}$) accuracy for optimizing $(2 - \lambda) \mathcal{L}_\text{CE} + \lambda \lVert\nabla_x \mathcal{L}\rVert_1$ for Swin Transformer. Models obtained by finetuning original GradNorm model from \ref{['tab:grad-norm-autoattack-table']} for 30 epochs due to computational cost of training from scratch. Maximum robust accuracy obtained for $\lambda=1.4$, which has clean accuracy 76.28 and AutoAttack accuracy 52.48. The robust accuracy gap to Adversarial Training is 3.64%.
...and 9 more figures

Characterizing Model Robustness via Natural Input Gradients

TL;DR

Abstract

Characterizing Model Robustness via Natural Input Gradients

Authors

TL;DR

Abstract

Table of Contents

Figures (14)