Table of Contents
Fetching ...

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha

TL;DR

IntCoOp addresses the interpretability gap in prompt-tuning for vision-language models by jointly learning attribute-level inductive biases and class embeddings. It introduces an attribute extractor and instance-conditioned cross-attention to generate interpretable prompts that integrate compositional attributes into the CLIP framework, guided by a multi-term loss including a structured regularization term. Across 10 datasets in few-shot regimes and under domain shifts, IntCoOp achieves substantial gains over prior prompt-tuning methods, including notable improvements in 16-shot settings and domain generalization (e.g., up to 19.32% average gains over PLOT). The work demonstrates that attribute-informed prompts not only improve accuracy but also yield prompts that are interpretable, aligning with human-understandable compositional concepts.

Abstract

Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a "green" tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-the-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

TL;DR

IntCoOp addresses the interpretability gap in prompt-tuning for vision-language models by jointly learning attribute-level inductive biases and class embeddings. It introduces an attribute extractor and instance-conditioned cross-attention to generate interpretable prompts that integrate compositional attributes into the CLIP framework, guided by a multi-term loss including a structured regularization term. Across 10 datasets in few-shot regimes and under domain shifts, IntCoOp achieves substantial gains over prior prompt-tuning methods, including notable improvements in 16-shot settings and domain generalization (e.g., up to 19.32% average gains over PLOT). The work demonstrates that attribute-informed prompts not only improve accuracy but also yield prompts that are interpretable, aligning with human-understandable compositional concepts.

Abstract

Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a "green" tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-the-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
Paper Structure (25 sections, 13 equations, 7 figures, 10 tables)

This paper contains 25 sections, 13 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: (a) Importance of learning interpretable concepts in prompts.Left: For each image, we design two prompt templates: (1) Without any compositional attribute "A photo of a $[cls]$" and (2) With compositional information "A photo of a $[a]$$[cls]$" where $[cls]$ represents the classname and $[a]$ represents an attribute obtained using a BLIP-2 based VQA model. Right: The distribution plot highlights the importance of baking attribute information into the prompts. For this analysis, we used a CLIP model with a ViT-B/16 image encoder and a dataset consisting of $50$ images selected randomly from each of $1000$ classes in ImageNet-1k. The x-axis indicates the predicted CLIP score. Clearly, the CLIP model is more confident when the prompts include information related to the compositionality of the image. (b) Framework for obtaining attribute-level supervision. We present the overarching architecture for generating attribute labels $a$ for a given training image using BLIP-2 VQA model.
  • Figure 2: Framework for learning compositional attributes. The figure elucidates the training framework of the attribute extractor network $\mathcal{A}$.
  • Figure 3: We measure the cosine similarity between the learned attribute embedding $\mathcal{A}(\mathcal{V}(\mathcal{I}))$ and the BLIP-2 generated label $a_{\mathcal{I}}$. A high cosine similarity indicates that $\texttt{Int}$CoOp effectively learns contextually relevant attributes.
  • Figure 4: $\texttt{Int}$CoOp generates relevant attributes during inference. We measure the cosine similarity between the prompt embeddings with the attribute information from $\texttt{Int}$CoOp and the prompt template "A photo of $[a]$$[cls]$". We find that prompt embeddings from $\texttt{Int}$CoOp result in a higher cosine similarity with hand-crafted prompt template.
  • Figure 5: CLIP Confidence plots. Distribution plots of CLIP confidence score across different datasets highlighting the importance of incorporating compositionality information into the prompts.
  • ...and 2 more figures