Table of Contents
Fetching ...

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao, Qiulei Dong

TL;DR

A causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference.

Abstract

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

TL;DR

A causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference.

Abstract

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

Paper Structure

This paper contains 12 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Causal graphs of (a) attribute disentanglement and (b) attribute-driven prompt learning for recognition.
  • Figure 2: Architecture of CaPL, where (a) is the training stage and (b) is the inference stage. $\mathbf{x}_i$ and $\mathbf{x}$ are visual features, $\mathbf{s}_i$ and $\mathbf{d}_i$ are the non-individualized and individualized attribute representations, $\mathbf{p}_1,...,\mathbf{p}_C$ are the prompted textual features generated from a learnable text prompt and class names, $C$ is the number of classes, and the "lock" symbol denotes the corresponding parameters are fixed.
  • Figure 3: Architecture of attribute disentanglement module, which contains two encoders $E_s,E_d$ to extract non-individualized and individualized attribute representation $\mathbf{s}_i, \mathbf{d}_i$ from the visual feature $\mathbf{x}_i$ respectively, and a BBDM-based network. The upper feature transfer process of BBDM is the diffusion process, which generates latent features $\mathbf{z}_0,...,\mathbf{z}_T$. The lower one is the reverse process, which generate reconstructed features $\hat{\mathbf{z}}_T,...,\hat{\mathbf{z}}_0$ gradually, $\mathcal{T}_\theta$ is a learnable transfer model, and $\mathcal{L}_A$ is the training loss.
  • Figure 4: Architecture of granule learning module, which has two forms for (a) factual intervention and (b) counterfactual intervention. "$Q$" is the query process, $\{\mathbf{a}_{d,i}^k\}_{k=1}^K$ and $\{\mathbf{a}_{p,c}^k\}_{c=1,k=1}^{C,K}$ are the visual and textual representations of each individualized attribute, $\{\mathbf{d}_i\}_{i=1}^N$ are $N$ individualized attributes in a training bath, $D$ is the decoder to generate visual granules, $\mathcal{L}_{fac},\mathcal{L}_{con}$ are training losses.
  • Figure 5: T-SNE visualizations of 10 images from 10 different classes of the StanfordCars datasetCars. (a) visualizes the non-individualized attribute representations (denoted as "$\circ$") and individualized attribute representations (denoted as "[1]$\square$"). (b) visualizes the input visual features (denoted as "[1]$\triangle$") of the 10 images, and the counterfactual granules (denoted as "[1]$\times$") constructed by swapping the attributes across the 10 images.
  • ...and 1 more figures