Table of Contents
Fetching ...

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang

TL;DR

This work revisits the pre-training and adaptation processes of CLIP and develops a structural causal model, and proposes Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge.

Abstract

Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge. Specifically, we decouple semantics contained in the data of downstream tasks and perform classification based on each semantic. Furthermore, we employ the Dempster-Shafer evidence theory to evaluate the uncertainty of each prediction generated by diverse semantics. Experiments conducted in multiple different settings have consistently demonstrated the effectiveness of CDC.

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

TL;DR

This work revisits the pre-training and adaptation processes of CLIP and develops a structural causal model, and proposes Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge.

Abstract

Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge. Specifically, we decouple semantics contained in the data of downstream tasks and perform classification based on each semantic. Furthermore, we employ the Dempster-Shafer evidence theory to evaluate the uncertainty of each prediction generated by diverse semantics. Experiments conducted in multiple different settings have consistently demonstrated the effectiveness of CDC.

Paper Structure

This paper contains 21 sections, 12 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) A motivating example of task misalignment, illustrating the cosine similarities between an image and various text descriptions in the embedding space of CLIP. (b) A motivating experiment on data misalignment, showing the accuracy trends for base and new classes across different training epochs on the DTD dataset.
  • Figure 2: SCMs. Solid and dashed circles indicate the observable and unobservable variables, respectively.
  • Figure 3: Framework of CDC. $t^m$ denotes a single template, while $p_1, p_2, ..., p_d$ represent tokens in the template. Different colors indicate diverse templates. "fuse" refers to the process of generating the final classification results from multiple template results as shown in Equation \ref{['eq:cdc_totalfuse']}. The text encoder and the image encoder are frozen, and only the tokens in the prompt templates are learnable.
  • Figure 4: The impact of $\beta$ on performance.
  • Figure 5: The impact of $\gamma$ on performance.
  • ...and 1 more figures