Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt

Jiangmeng Li; Wenyi Mo; Wenwen Qiang; Bing Su; Changwen Zheng; Hui Xiong; Ji-Rong Wen

Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt

Jiangmeng Li, Wenyi Mo, Wenwen Qiang, Bing Su, Changwen Zheng, Hui Xiong, Ji-Rong Wen

TL;DR

This paper tackles the limited effectiveness of traditional prompts in vision-language models by injecting label-relevant semantic information derived from ontological knowledge graphs. It introduces CPKP, a semantic-aware prompting framework with a double-tier confounder-pruning mechanism (graph-tier and feature-tier) to prune task-irrelevant and redundant information, respectively. Through ontology-enhanced embeddings and learnable prompts in SPE and SHR forms, CPKP achieves state-of-the-art performance in few-shot and domain generalization across 11 datasets and multiple backbones, while demonstrating robustness to distribution shifts. The approach offers a principled, knowledge-grounded path to more transferable vision-language prompt design with practical implications for open-set recognition tasks.

Abstract

Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open-set visual concepts. To boost the transferability of the pre-trained models, recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language describing task-relevant categories, to reduce the gap between tasks in the training and test phases. However, how and what prompts can improve inference performance remains unclear. In this paper, we explicitly clarify the importance of including semantic information in prompts, while existing prompting methods generate prompts without exploring the semantic information of textual labels. Manually constructing prompts with rich semantics requires domain expertise and is extremely time-consuming. To cope with this issue, we propose a semantic-aware prompt learning method, namely CPKP, which retrieves an ontological knowledge graph by treating the textual label as a query to extract task-relevant semantic information. CPKP further introduces a double-tier confounder-pruning procedure to refine the derived semantic information. The graph-tier confounders are gradually identified and phased out, inspired by the principle of Granger causality. The feature-tier confounders are demolished by following the maximum entropy principle in information theory. Empirically, the evaluations demonstrate the effectiveness of CPKP, e.g., with two shots, CPKP outperforms the manual-prompt method by 4.64% and the learnable-prompt method by 1.09% on average, and the superiority of CPKP in domain generalization compared to benchmark approaches. Our implementation is available at https://github.com/Mowenyii/CPKP.

Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt

TL;DR

Abstract

Paper Structure (40 sections, 19 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 40 sections, 19 equations, 10 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Vision-Language Models
Prompt Design
Knowledge Graph
Preliminaries
Vision-Language Pre-training
Architecture
Training
Inference
Graph Representation Learning
Graph Setup
Graph Neural Network
Methodology
Learnable Knowledge Prompt
...and 25 more sections

Figures (10)

Figure 1: Comparison of different prompt generation paradigms. a The paradigm of using the prompt with fixed templates DBLP:conf/eacl/SchickS21DBLP:conf/icml/RadfordKHRGASAM21. b The learning paradigms of recent benchmark works, including two major categories: the upper paradigm adopts a certain number of updatable tokens to generate adaptive prompts, and the tokens are learned during training DBLP:journals/corr/abs-2109-01134DBLP:journals/corr/abs-2112-01518; the lower paradigm uses the same prompt with fixed templates as in a, but further injects an adapter after the fixed text encoder of the pre-trained vision-language model, and the adapter is trainable during inference on downstream tasks, including the adapter training and prediction DBLP:journals/corr/abs-2110-04544DBLP:journals/corr/abs-2112-01518. c The learning paradigm of our method, which directly learn a prompt from the labels by leveraging the effective semantic information from an ontological knowledge graph.
Figure 2: Comparison of different prompt forms for CLIP. We conduct zero-shot inference experiments and the results are shown in the histogram, where grey bars denote the prompt without semantic information which CLIP uses, brown bars denote the prompt with simple coarse-grained semantic information, and purple bars denote the prompt using more words to describe similar semantic information. We observe that both semantic prompt and longer semantic prompt further improve the performance of CLIP. In contrast, the improvement of longer semantic prompt over semantic prompt is limited, which proves that the improvement of CLIP's performance relies on the addition of semantic information rather than simply adding more words. See Appendix 4 for the detailed performance gap between CPKP and CLIP using manual prompts.
Figure 3: The architecture of CPKP. The intuition behind our method is to directly learn a prompt with label-related semantic information rather than adopting a fixed prompt template, which is achieved by introducing refined knowledge from an external knowledge graph. To this end, CPKP consists of two stages: 1) ontology-enhanced knowledge embedding derives the label-related subgraph from an ontological knowledge graph by using the label token as a query; 2) double-tier confounder-pruning removes the task-irrelevant and redundant information from graph representations.
Figure 4: An example of the rationale of the graph-tier confounder-pruning for graph representations. We refine the derived knowledge subgraph $\boldsymbol{G}_i$ by pruning the edges that are causally decoupled from the downstream task. We determine whether a relation-type$\boldsymbol{r}_m$ is predictive of the graph by iteratively removing the edges related to the relation-type $\boldsymbol{r}_m$ and then checking the oscillation of the result, which is computed by following a specific graph rule. Only causally related edges are kept, and others are pruned. Note that the graph encoder $f^G \left( \cdot \right)$ is fixed throughout the process.
Figure 5: The visualization of the representations learned by variants of our method in ImageNet: 1) CPKP w/ FTCP presents the proposed method using the feature-tier confounder-pruning technique; 2) CPKP w/o FTCP presents the proposed method without the feature-tier confounder-pruning. Concretely, the learned prompt feature representations are projected into an RGB-styled color image. Different colors present different types of information in features. The abscissa axis presents the feature dimensions, and the ordinate axis presents various categories. The more different colors represent the less similar feature dimensions. The two left plots represent the contributions of dimensions to a specific category classification, and the right plots represent the similarities between feature dimensions. As the observation, the feature-tier confounder-pruning technique can indeed effectively eliminate the redundancy of the learned representations.
...and 5 more figures

Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt

TL;DR

Abstract

Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt

Authors

TL;DR

Abstract

Table of Contents

Figures (10)