ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models

Thanh-Dat Truong; Xin Li; Bhiksha Raj; Jackson Cothren; Khoa Luu

ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models

Thanh-Dat Truong, Xin Li, Bhiksha Raj, Jackson Cothren, Khoa Luu

TL;DR

This work addresses domain generalization for vision-language foundation models by integrating diffusion-based adversarial augmentation. It formulates a worst-case DG objective around the training distribution and leverages a Transport Transformation in latent diffusion space to generate semantically varied yet conditioned adversarial samples, ensuring distributional distance remains bounded by $\rho$. The approach trains CLIP using both real and diffusion-generated adversarial samples, with experiments on CC3M, CC12M, and LAION-400M showing state-of-the-art zero-shot, linear probing, and fine-tuning performance, and demonstrating the method's scalability. The authors provide theoretical insights (including Proposition 1) and practical guidance (pretrained LDM, $M$=10, $\rho$=0.5), while acknowledging computational costs and potential societal impacts of large diffusion models.

Abstract

The Vision-Language Foundation Model has recently shown outstanding performance in various perception learning tasks. The outstanding performance of the vision-language model mainly relies on large-scale pre-training datasets and different data augmentation techniques. However, the domain generalization problem of the vision-language foundation model needs to be addressed. This problem has limited the generalizability of the vision-language foundation model to unknown data distributions. In this paper, we introduce a new simple but efficient Diffusion Sampling approach to Domain Generalization (ED-SAM) to improve the generalizability of the vision-language foundation model. Our theoretical analysis in this work reveals the critical role and relation of the diffusion model to domain generalization in the vision-language foundation model. Then, based on the insightful analysis, we introduce a new simple yet effective Transport Transformation to diffusion sampling method. It can effectively generate adversarial samples to improve the generalizability of the foundation model against unknown data distributions. The experimental results on different scales of vision-language pre-training datasets, including CC3M, CC12M, and LAION400M, have consistently shown State-of-the-Art performance and scalability of the proposed ED-SAM approach compared to the other recent methods.

ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models

TL;DR

. The approach trains CLIP using both real and diffusion-generated adversarial samples, with experiments on CC3M, CC12M, and LAION-400M showing state-of-the-art zero-shot, linear probing, and fine-tuning performance, and demonstrating the method's scalability. The authors provide theoretical insights (including Proposition 1) and practical guidance (pretrained LDM,

=10,

=0.5), while acknowledging computational costs and potential societal impacts of large diffusion models.

Abstract

Paper Structure (18 sections, 13 equations, 6 figures, 8 tables)

This paper contains 18 sections, 13 equations, 6 figures, 8 tables.

Introduction
Related Work
Theoretical Analysis of Generalizability in Foundation Model
Preliminary
Domain Generalization of Contrastive Language-Image Pre-Training
The Relation of Diffusion to Adversarial Augmentation
The Proposed Transport Transformation
The Proposed Diffusion-based Domain Generalization Training Approach
Experiments
Datasets, Implementations, and Evaluations
Ablation Studies
Comparisons With State-of-the-Art Approaches
Conclusions, Limitations, and Broader Impact
Proof of Proposition 1
Additional Ablation Study
...and 3 more sections

Figures (6)

Figure 1: Comparison between Our Proposed Diffusion-based Domain Generalization with Prior Methods volpi2018generalizingzhong2022adversarialli2023scaling.
Figure 2: The Comparison of Our Diffusion-based Adversarial Sample and Prior Augmentations (Adversarial Sample volpi2018generalizing, Adversarial Style zhong2022adversarial, Masking Sample li2023scaling).
Figure 3: The Relation Between Adversarial Sample and Source Data.
Figure 4: The Proposed Diffusion-based Domain Generalization Framework
Figure 5: Our Diffusion-based Adversarial Samples. The first figure of each row is the original image.
...and 1 more figures

ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models

TL;DR

Abstract

ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)