Table of Contents
Fetching ...

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

Kaixin Zhang, Zhixiang Yuan, Tao Huang

TL;DR

This work tackles zero-shot multi-label classification by generating tailored synthetic data for unseen labels using diffusion models guided by large-language-model prompts. A CLIP-based discriminator filters generated images to ensure correct multi-label content, while a CLIP-informed, class-discriminative text-encoder fine-tuning improves generation quality. A plug-in global feature fusion module adapts the visual encoder to the multi-label domain without eroding pre-trained knowledge. Empirical results on MS-COCO and NUS-WIDE show consistent gains over state-of-the-art methods in both ZSL and GZSL settings, validating the effectiveness of diffusion-based data augmentation, prompt diversification, and feature adaptation for ZS-MLC.

Abstract

Recently, zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance. Drawing inspiration from the success of text-to-image generation models in producing realistic images, we propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels. Our approach introduces a novel image generation framework that produces multi-label synthetic images of unseen classes for classifier training. To enhance diversity in the generated images, we leverage a pre-trained large language model to generate diverse prompts. Employing a pre-trained multi-modal CLIP model as a discriminator, we assess whether the generated images accurately represent the target classes. This enables automatic filtering of inaccurately generated images, preserving classifier accuracy. To refine text prompts for more precise and effective multi-label object generation, we introduce a CLIP score-based discriminative loss to fine-tune the text encoder in the diffusion model. Additionally, to enhance visual features on the target task while maintaining the generalization of original features and mitigating catastrophic forgetting resulting from fine-tuning the entire visual encoder, we propose a feature fusion module inspired by transformer attention mechanisms. This module aids in capturing global dependencies between multiple objects more effectively. Extensive experimental results validate the effectiveness of our approach, demonstrating significant improvements over state-of-the-art methods.

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

TL;DR

This work tackles zero-shot multi-label classification by generating tailored synthetic data for unseen labels using diffusion models guided by large-language-model prompts. A CLIP-based discriminator filters generated images to ensure correct multi-label content, while a CLIP-informed, class-discriminative text-encoder fine-tuning improves generation quality. A plug-in global feature fusion module adapts the visual encoder to the multi-label domain without eroding pre-trained knowledge. Empirical results on MS-COCO and NUS-WIDE show consistent gains over state-of-the-art methods in both ZSL and GZSL settings, validating the effectiveness of diffusion-based data augmentation, prompt diversification, and feature adaptation for ZS-MLC.

Abstract

Recently, zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance. Drawing inspiration from the success of text-to-image generation models in producing realistic images, we propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels. Our approach introduces a novel image generation framework that produces multi-label synthetic images of unseen classes for classifier training. To enhance diversity in the generated images, we leverage a pre-trained large language model to generate diverse prompts. Employing a pre-trained multi-modal CLIP model as a discriminator, we assess whether the generated images accurately represent the target classes. This enables automatic filtering of inaccurately generated images, preserving classifier accuracy. To refine text prompts for more precise and effective multi-label object generation, we introduce a CLIP score-based discriminative loss to fine-tune the text encoder in the diffusion model. Additionally, to enhance visual features on the target task while maintaining the generalization of original features and mitigating catastrophic forgetting resulting from fine-tuning the entire visual encoder, we propose a feature fusion module inspired by transformer attention mechanisms. This module aids in capturing global dependencies between multiple objects more effectively. Extensive experimental results validate the effectiveness of our approach, demonstrating significant improvements over state-of-the-art methods.
Paper Structure (18 sections, 10 equations, 8 figures, 7 tables, 2 algorithms)

This paper contains 18 sections, 10 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: The structure of our image generation framework
  • Figure 2: Examples of synthetic images. (a) The synthetic images are generated by fixed prompts guide. (b) The synthetic images are generated by augmented prompts guide. (c) The phenomenon of missing objects persists in images generated by the diffusion model
  • Figure 3: Directs the process for the large language model to create augmented prompts
  • Figure 4: The structure of our model. Based on DualCoOp, we introduce a global feature fusion (GFF) module that is combined with 3x3 convolutional layers in the visual encoder
  • Figure 5: Illustration of the global feature fusion module
  • ...and 3 more figures