Table of Contents
Fetching ...

COMMA: Co-Articulated Multi-Modal Learning

Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng

TL;DR

COMMA tackles prompt optimization for large vision-language models by correlating prompts across vision and language branches and preserving pretrained generic knowledge through late-layer prompt alignment. It introduces correlated prompt generation and a prompt-level knowledge-transfer objective, forming the total loss $\mathcal{L}_{\rm Total} = \mathcal{L}_{\rm ce} + \lambda \sum_{i=0}^{S} (1 - \mathcal{L}_{\rm kd}^i)$ with $\mathcal{L}_{\rm kd} = \mathrm{Sim}(P_s^l, P_s^{\rm CLIP})$, and defines vision prompts via $P_i^v = \mathrm{softmax}\left(\frac{P_{i-1}^v \cdot P_{i-1}^l}{\sqrt{P}}\right) P_{i-1}^l$. Through experiments on base-to-novel generalization, cross-dataset transfer, and domain generalization, COMMA consistently outperforms state-of-the-art prompt methods with modest computational overhead, demonstrating improved robustness to novel concepts and domain shifts while reducing fine-tuning demands.

Abstract

Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency.

COMMA: Co-Articulated Multi-Modal Learning

TL;DR

COMMA tackles prompt optimization for large vision-language models by correlating prompts across vision and language branches and preserving pretrained generic knowledge through late-layer prompt alignment. It introduces correlated prompt generation and a prompt-level knowledge-transfer objective, forming the total loss with , and defines vision prompts via . Through experiments on base-to-novel generalization, cross-dataset transfer, and domain generalization, COMMA consistently outperforms state-of-the-art prompt methods with modest computational overhead, demonstrating improved robustness to novel concepts and domain shifts while reducing fine-tuning demands.

Abstract

Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency.
Paper Structure (28 sections, 9 equations, 4 figures, 7 tables)

This paper contains 28 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: COMMA outperforms state-of-the-art methods across 10/11 diverse image recognition datasets on the base-to-novel generalization task.
  • Figure 2: The overview for COMMA. Here, $\mathcal{L}_{ce}$ denotes the cross-entropy loss and $\mathcal{L}_{kd}$ represents the knowledge distillation loss between two branches. COMMA generates the prompts of the vision branch based on preceding prompts of both branches to aggregate multi-modal beneficial information to guide their representation alignment. Besides, it let the learned prompts approximate the hand-crafted prompts in the pre-trained CLIP model to preserve generic knowledge.
  • Figure 3: Relationships concerning the degree of performance degradation $\Delta {\rm Acc}$ with the distance between the learnable prompts in CoOp and the hand-crafted prompts in the pretrained CLIP across different layers over 11 datasets.
  • Figure 4: Relationships concerning the base-class, novel-class and harmonic mean accuracy with the number of reciprocal layers ($S$). Accuracies are averaged over 11 datasets.