Table of Contents
Fetching ...

TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt

Xiangyu Wu, Qing-Yuan Jiang, Yang Yang, Yi-Feng Wu, Qing-Guo Chen, Jianfeng Lu

TL;DR

This paper addresses multi-label image classification with limited reliance on labeled visual data by introducing a pseudo-visual prompt (PVP) that learns diverse visual knowledge through the CLIP alignment space. A transferable prompt co-learning framework with a dual-adapter transfers visual cues from the PVP to text prompts via contrastive and ranking losses, while two training-text data strategies—hand-annotated and LLM-generated—support robust learning. The approach achieves state-of-the-art zero-shot and competitive few-shot and partial-label performance on VOC2007, MS-COCO, and NUSWIDE, demonstrating strong cross-modal generalization and practical applicability with publicly available code. By decoupling visual diversity learning from large labeled datasets and enabling efficient cross-modal transfer, the method offers a scalable pathway for robust multi-label recognition in data-constrained settings.

Abstract

The recent introduction of prompt tuning based on pre-trained vision-language models has dramatically improved the performance of multi-label image classification. However, some existing strategies that have been explored still have drawbacks, i.e., either exploiting massive labeled visual data at a high cost or using text data only for text prompt tuning and thus failing to learn the diversity of visual knowledge. Hence, the application scenarios of these methods are limited. In this paper, we propose a pseudo-visual prompt~(PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models. Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art~(SOTA) methods across various settings for multi-label image classification tasks. The code is available at https://github.com/njustkmg/PVP.

TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt

TL;DR

This paper addresses multi-label image classification with limited reliance on labeled visual data by introducing a pseudo-visual prompt (PVP) that learns diverse visual knowledge through the CLIP alignment space. A transferable prompt co-learning framework with a dual-adapter transfers visual cues from the PVP to text prompts via contrastive and ranking losses, while two training-text data strategies—hand-annotated and LLM-generated—support robust learning. The approach achieves state-of-the-art zero-shot and competitive few-shot and partial-label performance on VOC2007, MS-COCO, and NUSWIDE, demonstrating strong cross-modal generalization and practical applicability with publicly available code. By decoupling visual diversity learning from large labeled datasets and enabling efficient cross-modal transfer, the method offers a scalable pathway for robust multi-label recognition in data-constrained settings.

Abstract

The recent introduction of prompt tuning based on pre-trained vision-language models has dramatically improved the performance of multi-label image classification. However, some existing strategies that have been explored still have drawbacks, i.e., either exploiting massive labeled visual data at a high cost or using text data only for text prompt tuning and thus failing to learn the diversity of visual knowledge. Hence, the application scenarios of these methods are limited. In this paper, we propose a pseudo-visual prompt~(PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models. Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art~(SOTA) methods across various settings for multi-label image classification tasks. The code is available at https://github.com/njustkmg/PVP.
Paper Structure (13 sections, 12 equations, 5 figures, 4 tables)

This paper contains 13 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Different prompt tuning paradigms for multi-label image recognition. (a). Tuning text prompt with labeled visual data. (b). Tuning text prompt with labeled text data. (c). Tuning text prompt with texts generated by LLMs. (d). Testing with images and text prompt for image recognition.
  • Figure 2: Pseudo-Visual Prompt Learning and Transferable Prompt Co-Learning. Sub-Figure (a) presents the class-specific pseudo-visual prompt module. The global text embedding and pseudo-visual prompt embedding are obtained from the frozen CLIP image and text encoders. The corresponding cosine similarity between the embeddings is guided by the noun-filtered labels with ranking loss. Sub-Figure (b) presents the transferable prompt co-learning module. We perform contrastive learning between the pseudo-visual prompt and the text prompt to enhance the prompts' visual diversity representation capability.
  • Figure 3: Results for few-shot setting, where the performance of PVP*/TAI-DPT* integrate the predictions of PVP/TAI-DPT and CoOp.
  • Figure 4: Visualization of PVP and TAI-DPT methods.
  • Figure 5: The mAP value with different number of text data on MS-COCO dataset.