Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

Tianxiang Hao; Mengyao Lyu; Hui Chen; Sicheng Zhao; Xiaohan Ding; Jungong Han; Guiguang Ding

Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

Tianxiang Hao, Mengyao Lyu, Hui Chen, Sicheng Zhao, Xiaohan Ding, Jungong Han, Guiguang Ding

TL;DR

This work introduces Information Density (ID) as a metric for the information concentration in prompt-tuning and proposes Dense Information Prompt (DIP) to boost ID and hence generalization in vision-language models. DIP uses a low-rank prompt factorization with a concurrent full-rank initialization path and light regularization to achieve strong performance with roughly 0.5K trainable parameters, outperforming several baselines across 11 datasets and multiple generalization settings. Empirical results show a high correlation between ID and unseen-class generalization (Spearman ρ ≥ 0.9) and demonstrate DIP’s efficiency: reduced training/storage, competitive or superior accuracy, and robust performance under domain shifts and few-shot scenarios. The approach is plug-and-play, simple to implement, and yields substantial practical impact for resource-constrained adaptation of large vision-language models.

Abstract

With the advancement of large pre-trained vision-language models, effectively transferring the knowledge embedded within these foundational models to downstream tasks has become a pivotal topic, particularly in data-scarce environments. Recently, parameter-efficient fine-tuning approaches, especially prompt tuning, have garnered considerable attention. To better understand the nature of prompt tuning, we propose the concept of ``Information Density'' (ID) to indicate whether a matrix strongly belongs to certain feature spaces rather than being evenly distributed across various feature spaces. We suppose a higher ID with strong bias across some feature spaces naturally leads to excellent robustness and stability. Our research, inspired by the observation that generalizability is closely linked to the information density of the prompt matrix, introduces the Dense Information Prompt (DIP). DIP aims to enhance information density to improve generalization. Furthermore, DIP significantly reduces the number of tunable parameters and the requisite storage space, making it particularly advantageous in resource-constrained settings. Comprehensive experiments substantiate the superiority of DIP. Notably, DIP surpasses the latest state-of-the-art methods by a substantial margin with an exceptionally small parameter count. Across a range of tasks spanning 11 datasets, DIP improves the average downstream accuracy of classic prompt tuning by up to 5.76% using merely 0.5K parameters.

Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

TL;DR

Abstract

Paper Structure (31 sections, 6 figures, 11 tables)

This paper contains 31 sections, 6 figures, 11 tables.

Introduction
Related Works
Vision-Language Models
Prompt Tuning
Relationship between Information Density and Generalizability
A Review of Prompt Tuning for CLIP
Information Density in Prompt Tuning
Methodology
Dense Information Prompt
Algorithms for increasing information density
Special Initialization
Regularization
Efficiency Analysis
Experiments
Main Results
...and 16 more sections

Figures (6)

Figure 1: Relationship between generalizability represented by the test accuracy on unseen classes during training and Information Density (ID). When generalizability increases, ID also increases. The Spearman correlation coefficient $\rho$ between generalizability and ID1/ID2 is very high, i.e.$\geq$ 0.9.
Figure 2: To switch from classic prompt tuning and to DIP tuning, just replace the ordinary prompts with DIPs. DIP introduces two small parameter matrices in shape $[n,r]$ and $[r,d]$ separately, and uses their product as an equivalent prompt in shape $[n,d]$. For the prompts with special initialization, e.g. a hand-crafted template "a photo of a" for the text prompts, we introduce a concurrent full-rank prompt branch along with the proposed low-rank prompts. By turning off the gradient of the newly added branch, we start training from a promising initial point, and the total number of tunable parameters or stored parameters will not increase as well. The Dropout layer could effectively regularize the update of low-rank prompts and alleviate overfitting and catastrophic forgetting. Dropout is a lightweight non-parametric layer and turns out to be an Identity layer in inference, resulting in negligible cost.
Figure 3: Few-shot learning Results.
Figure 4: Top: Effect of training epochs. Bottom: Effect of dropout ratios.
Figure 5: Results on ResNet-50 encoded CLIP.
...and 1 more figures

Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

TL;DR

Abstract

Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability

Authors

TL;DR

Abstract

Table of Contents

Figures (6)