Table of Contents
Fetching ...

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning

Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, Liujuan Cao

TL;DR

This work tackles the coarse nature of many prompt-tuning approaches for vision-language models by introducing TextRefiner, a plug-and-play module that refines text prompts using internal visual knowledge from the image branch. It adds a local cache to store fine-grained visual attributes, a feature aggregation step to fuse local and global information, and a feature alignment module to map local features into the text space, all without external LLMs. Training combines standard prompt-tuning loss with a semantic alignment term and a regularization term, while inference relies on a compact, efficient matching between image features and refined text embeddings. Empirically, TextRefiner improves base-to-novel and cross-domain generalization, achieves competitive or superior performance to LLM-based approaches, and maintains high inference efficiency, making it a practical enhancement for VLM prompt tuning.

Abstract

Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66 % to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise features for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance and is efficient in inference. Our code is relesed at https://github.com/xjjxmu/TextRefiner

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning

TL;DR

This work tackles the coarse nature of many prompt-tuning approaches for vision-language models by introducing TextRefiner, a plug-and-play module that refines text prompts using internal visual knowledge from the image branch. It adds a local cache to store fine-grained visual attributes, a feature aggregation step to fuse local and global information, and a feature alignment module to map local features into the text space, all without external LLMs. Training combines standard prompt-tuning loss with a semantic alignment term and a regularization term, while inference relies on a compact, efficient matching between image features and refined text embeddings. Empirically, TextRefiner improves base-to-novel and cross-domain generalization, achieves competitive or superior performance to LLM-based approaches, and maintains high inference efficiency, making it a practical enhancement for VLM prompt tuning.

Abstract

Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66 % to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise features for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance and is efficient in inference. Our code is relesed at https://github.com/xjjxmu/TextRefiner

Paper Structure

This paper contains 27 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of different paradigms for enriching text prompts. Previous methods (left) introduced external knowledge experts to furnish fine-grained descriptions of each class, necessitating an extra filtering process to maintain alignment with downstream datasets. In contrast, our proposed TextRefiner (right) leverages the internal knowledge of the image branch to supply fine-grained, localized region information, thereby drastically reducing the inference overhead while maintaining the performance.
  • Figure 2: The framework of TextRefiner, which is composed of local cache, feature aggregation and feature alignment. Here, each item in the local cache can be considered as an attribute prior which will be updated by local tokens from the image branch. Therefore, textual class embedding can obtain corresponding linguistic visual attributes by querying this cache.
  • Figure 3: Comparison of inference efficiency among existing methods on the ImageNet dataset. Our TextRefiner is more efficient than LLaMP which relies on external knowledge experts to furnish fine-grained descriptions of each class.
  • Figure 4: Ablation study on $M$ in $\textbf{A}$.
  • Figure 5: Ablation study on aggregation coefficient in Eq. \ref{['eq:agg']}.