Table of Contents
Fetching ...

The Solution for Language-Enhanced Image New Category Discovery

Haonan Xu, Dian Chao, Xiangyu Wu, Zhonghua Wan, Yang Yang

TL;DR

The paper tackles the limitation of text-only prompts in capturing the full visual diversity for zero-shot multi-label recognition. It reverses CLIP training to learn per-class pseudo-visual prompts (PVP) from large-scale, LLM-generated sentence data, storing rich visual information in a class-specific visual prompt space. This visual knowledge is transferred to textual prompts via contrastive learning, aided by a dual-adapter that fuses original CLIP knowledge with downstream information. Empirically, the approach yields strong gains over baselines and competitive state-of-the-art performance on both clean and pseudo-text data, demonstrating robust zero-shot recognition with reduced reliance on annotated images.

Abstract

Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a dual-adapter module that simultaneously leverages knowledge from the original CLIP and new learning knowledge derived from downstream datasets. Benefiting from the pseudo visual prompts, our method surpasses the state-of-the-art not only on clean annotated text data but also on pseudo text data generated by large language models.

The Solution for Language-Enhanced Image New Category Discovery

TL;DR

The paper tackles the limitation of text-only prompts in capturing the full visual diversity for zero-shot multi-label recognition. It reverses CLIP training to learn per-class pseudo-visual prompts (PVP) from large-scale, LLM-generated sentence data, storing rich visual information in a class-specific visual prompt space. This visual knowledge is transferred to textual prompts via contrastive learning, aided by a dual-adapter that fuses original CLIP knowledge with downstream information. Empirically, the approach yields strong gains over baselines and competitive state-of-the-art performance on both clean and pseudo-text data, demonstrating robust zero-shot recognition with reduced reliance on annotated images.

Abstract

Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a dual-adapter module that simultaneously leverages knowledge from the original CLIP and new learning knowledge derived from downstream datasets. Benefiting from the pseudo visual prompts, our method surpasses the state-of-the-art not only on clean annotated text data but also on pseudo text data generated by large language models.
Paper Structure (12 sections, 17 equations, 2 figures, 1 table)

This paper contains 12 sections, 17 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: (a): NoCaps dataset, which always includes common objects such as animals, plants and furniture, etc. (b)(c): NICE Challenge dataset, which includes many novel visual concepts and various image types, such as famous historic, cultural and graphics, etc.
  • Figure 2: Learning and transferring process of Pseudo Visual Prompt, where we use human-annotated texts or pseudo texts generated by LLM to train the prompts. (a) During learning, we design identical class-specific visual prompts for each target category. The global text feature and object visual features are obtained from the frozen CLIP image and text encoder. The corresponding cosine similarity between the embeddings is guided by the derived pseudo labels with ranking loss. (b) During transferring, we perform contrastive learning between the trained pseudo-visual prompt and the text prompts to enhance the visual diversity representation capability of the text labels. The final classification results are obtained by merging the scores of the two branches.