Table of Contents
Fetching ...

CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzheng Ma, Xupeng Miao, Xuming He, Bin Cui

TL;DR

This work tackles the inefficiency of enhancing CLIP's zero-shot performance with task-specific downstream training. It introduces CALIP, a parameter-free cross-modal attention module that lets visual and textual features interact bidirectionally using CLIP's intermediate representations, eliminating extra data or training needs. CALIP-FS extends CALIP with lightweight trainable projections to excel in few-shot scenarios, achieving competitive or superior results across 2D and 3D benchmarks. The approach demonstrates robust, training-free improvement of multimodal alignment and transferability, with strong support from extensive experiments and ablation studies.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with great transferability, which achieves promising accuracy for zero-shot classification. To further improve its downstream performance, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP.

CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

TL;DR

This work tackles the inefficiency of enhancing CLIP's zero-shot performance with task-specific downstream training. It introduces CALIP, a parameter-free cross-modal attention module that lets visual and textual features interact bidirectionally using CLIP's intermediate representations, eliminating extra data or training needs. CALIP-FS extends CALIP with lightweight trainable projections to excel in few-shot scenarios, achieving competitive or superior results across 2D and 3D benchmarks. The approach demonstrates robust, training-free improvement of multimodal alignment and transferability, with strong support from extensive experiments and ablation studies.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with great transferability, which achieves promising accuracy for zero-shot classification. To further improve its downstream performance, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP.
Paper Structure (36 sections, 7 equations, 8 figures, 11 tables)

This paper contains 36 sections, 7 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Visualization of Parameter-free Attention and the Interacted Features. Without any parameters, CALIP's cross-modal attention map (Left-Bottom) shows favorable weight distributions over the main objects, which well updates both visual and textual features: pixels within objects of ground-truth labels are enhanced and the corresponding category features in red are strengthened.
  • Figure 2: The Pipeline of CALIP. We introduce a parameter-free attention module for zero-shot enhancement of CLIP and require no extra data or training for downstream tasks. CALIP utilizes pre-trained encoders to extract spatial visual feature of the input image and $K$-category textual feature. Then, the proposed attention module updates their representations via cross-modal interactions and outputs the final zero-shot prediction by weighted summation of three classification logits.
  • Figure 3: Structures of Parameter-free (Left) and Parametric Attention (Right). Parameter-free attention directly obtains the cross-modal attention map $A$ by matrix multiplication and bidirectionally updates two features for zero-shot classification. Parametric attention is equipped with both pre-projection and post-projection layers for better few-shot performance.
  • Figure 4: Zero-Shot Performance (%) of CALIP on Eleven 2D Datasets. Our zero-shot CALIP can consistently outperform CLIP and even surpass some methods with few-shot fine-tuning. "Linear." and "CLIP-A." denote Linear-probe CLIP and CLIP-Adapter, respectively.
  • Figure 5: Zero-Shot Performance (%) of CALIP on Three 3D Datasets. We extend CALIP for 3D point cloud recognition based on PointCLIP under zero-shot settings, where CALIP shows stable performance enhancement.
  • ...and 3 more figures