Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

Andy Zhou; Jindong Wang; Yu-Xiong Wang; Haohan Wang

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

Andy Zhou, Jindong Wang, Yu-Xiong Wang, Haohan Wang

TL;DR

This work tackles robust generalization for vision models under distribution shifts by marrying knowledge distillation with diverse data augmentation. It introduces Discrete Adversarial Distillation (DAD), which uses a robust foundation-model teacher (e.g., CLIP) and discretizes teacher-generated adversarial samples with a VQGAN to create informative augmentations, all within a KD objective augmented by the teacher's representations. A Wasserstein-distance based theory formalizes why diverse augmentations that resemble test distributions improve robustness, and empirical results on ViT-B/16 and ResNet50 show substantial gains on natural shifts (e.g., ImageNet-Sketch, ImageNet-Rendition) with modest overhead and compatibility with existing augmentations. The work provides a practical pathway to transfer foundation-model robustness to smaller students, albeit with limitations such as teacher bias toward the teacher and the need for broader semantic-shift evaluation.

Abstract

We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for better teachers by showing strong gains in out-of-distribution robustness when distilling from pretrained foundation models. Following this finding, we propose Discrete Adversarial Distillation (DAD), which leverages a robust teacher to generate adversarial examples and a VQGAN to discretize them, creating more informative samples than standard data augmentation techniques. We provide a theoretical framework for the use of a robust teacher in the knowledge distillation with data augmentation setting and demonstrate strong gains in out-of-distribution robustness and clean accuracy across different student architectures. Notably, our method adds minor computational overhead compared to similar techniques and can be easily combined with other data augmentations for further improvements.

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

TL;DR

Abstract

Paper Structure (23 sections, 4 theorems, 31 equations, 3 figures, 12 tables)

This paper contains 23 sections, 4 theorems, 31 equations, 3 figures, 12 tables.

Introduction
Related Work
Method
Setup
Distillation from a robust teacher
Discrete Adversarial Distillation
Theoretical Investigation
Experimental Results
Experimental Setup
Baselines
Main Experimental Results on ViT-B/16 and ResNet50
Ablations
Conclusion and limitations
Appendix
Additional results on vanilla knowledge distillation
...and 8 more sections

Key Result

Lemma 3.1

Given Assumptions 1 and 2 and variational divergence $tv$, for two arbitrary distributions $P$ and $P'$ with corresponding density functions $\delta$ and $\delta'$, $r(P') \leq r(P) + w(P',P)$.

Figures (3)

Figure 1: The overall framework of discrete adversarial distillation (DAD). We leverage a foundation model to generate and distill adversarial examples after discretization by a VQGAN.
Figure 2: Additional visualizations of generated images. To highlight the difference, we use adversarial examples that are classified differently by the base model. Using CLIP in DAD results in a more diverse adversarial example than a vanilla ResNet50. Adversarial examples in pixel-space are imperceptible.
Figure :

Theorems & Definitions (8)

Lemma 3.1
Lemma 3.2
proof
Lemma 3.3
Lemma 3.4
proof
proof
proof

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

TL;DR

Abstract

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)