AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Yuhan Zhu; Yuyang Ji; Zhiyu Zhao; Gangshan Wu; Limin Wang

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, Limin Wang

TL;DR

AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module, and consistently outperforms the state-of-the-art methods in each setting.

Abstract

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

TL;DR

Abstract

Paper Structure (57 sections, 9 equations, 10 figures, 14 tables)

This paper contains 57 sections, 9 equations, 10 figures, 14 tables.

Introduction
Related Work
Vision-Language Models.
Adapt VLMs to downstream tasks.
Optimal Transport (OT).
Methodology
Preliminaries
Contrastive Language-Image Pre-training (CLIP).
Optimal Transport (OT).
AWT: Augment, Weight, then Transport
Augment Raw Inputs
Weight Augmented Views
Transport Across Modalities
Experiments
Zero-shot Image Tasks
...and 42 more sections

Figures (10)

Figure 1: (a) Standard protocol directly calculates distances between raw images and class names in the joint V-L space. (b) Prompt-based methods enhance inputs with post-trained visual or textual prompts to provide the task-specific context. (c) Augment-based method enriches raw inputs with image transformations and class descriptions, requiring no additional training. Upon this, we propose AWT, which considers both intra-modal importance variations and cross-modal semantic correlations. (d) AWT is evaluated against SOTA methods across four tasks: zero-shot and few-shot image classification, out-of-distribution generalization, and zero-shot video action recognition.
Figure 2: Pipeline of AWT: Augment, Weight, then Transport. Given an image and candidate class names, we first augment each input into diverse views. These views are then fed into the CLIP model to obtain coarse predictions. To assess the importance of each view, we use prediction confidence as a proxy and introduce an entropy-based weighting mechanism. Next, we measure the distance between image-text view sets by solving an optimal transport (OT) problem. Finally, the resulting OT distance is used to represent the distance between the input image and each class name.
Figure 3: Few-shot image classification. We present the average accuracy across 11 datasets and specific accuracy for three datasets. Numerical values can be found at \ref{['tab:full-few-shot']}.
Figure 4: Versatility analysis of AWT. Average top-1 accuracy (%) on 18 image datasets is reported.
Figure 5: Comparison of image augmentation techniques on low-resolution images. We present images from the CIFAR-10/100 datasets, where each image is 32 $\times$ 32 pixels. The comparison includes images generated by traditional image transformations and DALL·E 2.
...and 5 more figures

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

TL;DR

Abstract

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)