Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Chao Yi; Yu-Hang He; De-Chuan Zhan; Han-Jia Ye

Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye

TL;DR

The paper tackles selecting pre-trained vision-language models for a target task using only text data, identifying Modality Gap and Capability Gap as key obstacles. It introduces SWAB, which uses optimal transport to build a bridge between open-source and target datasets, transferring class-wise modality gaps and rankings to predict target-task VLM performance without images. By modifying text embeddings with estimated gap vectors and combining predictions from two complementary sources, SWAB achieves state-of-the-art ranking accuracy on the LOVM benchmark. This approach enables robust VLM selection in data-limited settings and enhances practical reuse of a diverse VLM Zoo.

Abstract

Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zero-shot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" - the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" - the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of two gaps. SWAB first adopts optimal transport to capture the relevance between open-source and target datasets with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging two gaps. By bridging two gaps to obtain better substitutes for test images, SWAB can accurately predict the performance ranking of different VLMs on the target task without the need for the dataset's images. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.

Bridge the Modality and Capability Gaps in Vision-Language Model Selection

TL;DR

Abstract

Paper Structure (34 sections, 17 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 34 sections, 17 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Preliminary
Selecting VLMs from a Model Zoo
Possible Paradigms for LOVM
Analysis of the Two Gaps in LOVM
VLM Selection with Gap Bridging
Construct the Bridge Matrix Using Optimal Transport
Bridge the Modality Gap and Capability Gap
Summary of Swab
Experiments
Evaluation on LOVM Benchmark
Ablation Study
Influence of Key Components in Swab
Conclusion
LOVM Benchmark Details
...and 19 more sections

Figures (6)

Figure 1: Paradigm of Language-Only VLM Selection (LOVM). Users describe the details of their target tasks in text form, such as class names and image domains. Then, LOVM utilizes this information to generate class-related labeled texts through ChatGPT. These texts serve as substitutes for image samples in subsequent model selection algorithms. The model selection algorithm uses two types of data, including the open-source datasets (which have image and text data) and the text data from the target dataset, to predict the VLM's absolute or relative performance on a target dataset. It then selects the most appropriate VLM based on the predicted performance.
Figure 2: Validation Experiments on the Modality Gap and Capability Gap. (a) Predicted VLMs’ zero-shot image classification accuracy based on generated text data vs. VLM’s true accuracy based on test images. Each point in the graph represents a model. From the result, we can find that the predicted accuracy poorly aligns with the true accuracy, indicating these text data are ineffective substitutes for image data. (b) We calculate the zero-shot image classification performance rankings of 43 VLMs across 23 datasets. We compute the average standard deviations and the mean value of differences between each VLM's maximum and minimum ranking. The result shows the performance of a VLM varies greatly across different datasets.
Figure 3: The workflow of Swab. Swab first constructs a transport matrix $\boldsymbol{\gamma}^{*}\in\mathbb{R}^{k_{\mathcal{S}}\times k_{\mathcal{T}}}$ using optimal transport, based on textual semantic similarity between classes in the open-source datasets $C_{\mathcal{S}} = \{c_1^{\mathcal{S}}, \cdots, c_{k_{\mathcal{S}}}^{\mathcal{S}}\}$ and the target dataset's classes $C_{\mathcal{T}} = \{c_1^{\mathcal{T}}, \cdots, c_{k_{\mathcal{T}}}^{\mathcal{T}}\}$. Using this matrix, Swab estimates VLM $f_m$'s class-specific gap vectors $\{\boldsymbol{\hat{g}}_{m,1}^{\mathcal{T}},\cdots\}$ on the target dataset $\mathcal{T}$ from the gap vectors $\boldsymbol{G}_m^{\mathcal{S}}\in\mathbb{R}^{k_{\mathcal{S}} \times d}$ in the open-source datasets. These estimated gap vectors help modify text data to act as more effective substitutes for image data. The modified text data will then be input into the Ranker Model $f_R$, which predicts VLM's performance $\hat{r}_{m}^{\mathcal{T},(1)}$ on the target dataset. Besides, Swab also uses the transport matrix $\boldsymbol{\gamma}^{*}$ to predict VLM's performance ranking on the target dataset based on VLM's class-specific rankings $\boldsymbol{r}_m^{\mathcal{S}}\in \mathbb{R}^{k_{\mathcal{S}}}$ on open-source datasets. Finally, Swab combines these two ranking predictions $\hat{r}_{m}^{\mathcal{T},(1)}$ and $\hat{r}_{m}^{\mathcal{T}, (2)}$ to determine the VLM's final ranking prediction.
Figure 4: Comparison of the consistency metrics between the accuracy calculated using text data before and after bridging the gap and the model's true accuracy. After bridging the modality gap, the text data act as better substitutes for image data to evaluate the model's performance.
Figure 5: The distribution of image-to-image (i2i) cosine similarity, text-to-text (t2t) cosine similarity, and image-to-text (i2t) cosine similarity values for different BEiT-3 and BLIP models.
...and 1 more figures

Bridge the Modality and Capability Gaps in Vision-Language Model Selection

TL;DR

Abstract

Bridge the Modality and Capability Gaps in Vision-Language Model Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)