Table of Contents
Fetching ...

Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang

TL;DR

This work tackles the inefficiency of Vision-Language Models that rely on annotated image-text pairs and heavy dual-encoder architectures for zero-shot classification. It proposes LGCLIP, a training-free framework where a Large Language Model generates prompts to guide a diffusion model in creating visual prototypes for each class, and a lightweight visual encoder compares real images to these prototypes via contrastive prediction. The approach includes a training-free, multi-scale feature extraction module and demonstrates consistent gains across six diverse datasets and multiple CLIP backbones, highlighting robustness and potential to reduce annotation and computation costs. Overall, LGCLIP presents a novel paradigm that leverages generative modeling to produce visual prototypes for efficient, zero-shot classification without text-image alignment during inference.

Abstract

Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.

Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

TL;DR

This work tackles the inefficiency of Vision-Language Models that rely on annotated image-text pairs and heavy dual-encoder architectures for zero-shot classification. It proposes LGCLIP, a training-free framework where a Large Language Model generates prompts to guide a diffusion model in creating visual prototypes for each class, and a lightweight visual encoder compares real images to these prototypes via contrastive prediction. The approach includes a training-free, multi-scale feature extraction module and demonstrates consistent gains across six diverse datasets and multiple CLIP backbones, highlighting robustness and potential to reduce annotation and computation costs. Overall, LGCLIP presents a novel paradigm that leverages generative modeling to produce visual prototypes for efficient, zero-shot classification without text-image alignment during inference.

Abstract

Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.

Paper Structure

This paper contains 14 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Motivation Illustration. (a) We introduce our motivation with a simple question: "How do humans classify an object?". (b) Diffusion models are prone to two situations when using text images: i) wrong categories, ii) poor composition. Under the guidance of LLM, this situation can be optimized unsupervised. (c) The process and limitations of image classification methods under the traditional VLM paradigm. (d) The image classification method we proposed and its advantages.
  • Figure 2: Our proposed LGCLIP workflow.
  • Figure 3: Error statistics: The final error entry and category ratio for each data set. Here, when there is an intersection between prompt errors and images errors, it is only counted once.