Table of Contents
Fetching ...

P4Q: Learning to Prompt for Quantization in Visual-language Models

Huixin Sun, Runqi Wang, Yanjing Li, Xianbin Cao, Xiaolong Jiang, Yao Hu, Baochang Zhang

TL;DR

This work proposes a method that balances fine-tuning and quantization named ``Prompt for Quantization'' (P4Q), in which a lightweight architecture is designed to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization'' (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 $\times$ while achieving 66.94\% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24\% with negligible additional parameters on the ImageNet dataset.

P4Q: Learning to Prompt for Quantization in Visual-language Models

TL;DR

This work proposes a method that balances fine-tuning and quantization named ``Prompt for Quantization'' (P4Q), in which a lightweight architecture is designed to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization'' (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 while achieving 66.94\% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24\% with negligible additional parameters on the ImageNet dataset.
Paper Structure (15 sections, 14 equations, 5 figures, 3 tables)

This paper contains 15 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Cosine similarity predictions between text and image features. The horizontal axis contains 8 images, their descriptive text is vertically given in order. Each value corresponds to the similarity between the related image features and text features. The brighter the grid, the stronger the similarity between the image and the text on the current horizontal and vertical coordinates. (a) Image features and text features are encoded by full-precision CLIP. (b) Image features and text features are encoded by PTQ quantized CLIP. (c) Image features and text features are encoded by P4Q quantized CLIP.
  • Figure 2: Overview of P4Q. The blue parts represent the visual stream, and the green parts represent the textual stream. The learnable parameters are marked by the icon of fire. QFC denotes the quantized fully connected layer. Figure (a) shows the structure of quantized CLIP. Figure (b) shows the knowledge distillation module. $f_\theta$ and $g_\psi$ represent the full-precision image and text encoders respectively. $\hat{f}_\theta$ and $\hat{g}_\psi$ represent the low-bit image and text encoders respectively.
  • Figure 3: Histogram of image features and text features of the same class (‘wolf’) on CIFAR100, where the horizontal axis denotes the value of each element of the feature vectors, and the vertical axis denotes the number of elements. Mean and Ste Dev denote the mean and standard deviation of the distribution, respectively.
  • Figure 4: Overall structure of QAdapter.
  • Figure 5: Effect of prompt length and adapt ratio {M, $\alpha$}.