Table of Contents
Fetching ...

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

Shuo Yang, Zirui Shang, Yongqi Wang, Derong Deng, Hongwei Chen, Qiyuan Cheng, Xinxiao Wu

Abstract

This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt pretrained Vision-Language Model (VLM) like CLIP to multilabel classification. Through asking LLM by well-designed questions, we acquire comprehensive knowledge about characteristics and contexts of objects, which provides valuable text descriptions for learning prompts. Then we propose a hierarchical prompt learning method by taking the multi-label dependency into consideration, wherein a subset of category-specific prompt tokens are shared when the corresponding objects exhibit similar attributes or are more likely to co-occur. Benefiting from the remarkable alignment between visual and linguistic semantics of CLIP, the hierarchical prompts learned from text descriptions are applied to perform classification of images during inference. Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition. Extensive experiments on three public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that our method achieves better results than the state-of-the-art methods, especially outperforming the zero-shot multi-label recognition methods by 4.7% in mAP on MS-COCO.

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

Abstract

This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt pretrained Vision-Language Model (VLM) like CLIP to multilabel classification. Through asking LLM by well-designed questions, we acquire comprehensive knowledge about characteristics and contexts of objects, which provides valuable text descriptions for learning prompts. Then we propose a hierarchical prompt learning method by taking the multi-label dependency into consideration, wherein a subset of category-specific prompt tokens are shared when the corresponding objects exhibit similar attributes or are more likely to co-occur. Benefiting from the remarkable alignment between visual and linguistic semantics of CLIP, the hierarchical prompts learned from text descriptions are applied to perform classification of images during inference. Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition. Extensive experiments on three public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that our method achieves better results than the state-of-the-art methods, especially outperforming the zero-shot multi-label recognition methods by 4.7% in mAP on MS-COCO.
Paper Structure (19 sections, 18 equations, 5 figures, 5 tables)

This paper contains 19 sections, 18 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of different ways to handle novel categories. (a) Traditional methods train on base categories but fail on novel categories. (b) Recent prompting methods successfully adapt VLM to novel categories but need annotated data for prompt tuning. (c) Our data-free framework only performs prompt tuning to adapt VLM to novel categories by LLM.
  • Figure 2: Overview of our framework.
  • Figure 3: An example of the designed questions and their corresponding answers from ChatGLM. More detailed examples can be found in the supplementary materials.
  • Figure 4: Results on MS-COCO. (a) Analysis of the effect of number of tokens. (b) Analysis of the effect of weight between global and local prompts, i.e.$\lambda_2$ in Eq.(\ref{['eq:inference']}).
  • Figure 5: Visualization of top-3 predicated categories by different prompts.