Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Sravanti Addepalli; Ashish Ramayee Asokan; Lakshay Sharma; R. Venkatesh Babu

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, R. Venkatesh Babu

TL;DR

This work proposes Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student.

Abstract

Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

TL;DR

Abstract

Paper Structure (25 sections, 7 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 25 sections, 7 equations, 4 figures, 12 tables, 1 algorithm.

Introduction
Related Works
Notations
Robustness of CLIP embeddings
CLIP training and zero-shot prediction
Characteristics of image and text embeddings
Proposed Approach: VL2V-ADiP
Distillation from VLMs to Vision models
Self-Distillation from Text to Image encoders
VL2V - Align, Distill, Predict (VL2V-ADiP)
Experiments and Results
Evaluation Details
Training Details
Comparison with the SOTA
Distillation to lower capacity student models
...and 10 more sections

Figures (4)

Figure 1: Schematic diagram showing class and domain distributions in the shared text/ image embedding space of a VLM: VLMs learn highly specialized image representations that are not domain invariant. Thus, a linear classifier (red decision boundary) that is trained over the vision encoder using limited training data cannot generalize well to the target domain (shown in purple). On the other hand, generic text embeddings such as "A photo of a class" represent the core concept of a class by virtue of their training method and vast training data. Thus, they generalize effectively across domains, and a zero-shot classifier (green decision boundary) aligns better with the true distribution of classes.
Figure 2: Overview of the proposed approach VL2V-ADiP, consisting of (a) Align, (b) Distill and (c) Predict Stages for Black-Box Distillation from Vision-Language to Vision (VL2V) models.
Figure 3: OOD accuracy (%) of the proposed approach when compared to KD and ERM baselines for select classes/ domains in OfficeHome and Terra-Incognita datasets, where the (a) ERM-IN (linear) (b) CLIP Zero-shot (c) ERM-CLIP (linear) performance is poor.
Figure 4: OOD and ID accuracy (%) of the proposed approach VL2V-ADiP across variation in loss weight $\lambda$ for 4 Domain Generalization datasets. Cosine similarity of the student's projected features w.r.t. the text embeddings of the VLM teacher is given a weight of $(1-\lambda)$, while that w.r.t. the image embeddings of the VLM is given a weight of $\lambda$.

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

TL;DR

Abstract

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)