Table of Contents
Fetching ...

GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang

TL;DR

This work tackles the gap between high-capacity LLMs and task-specific vision models by proposing GPT4Image, a framework that uses multimodal LLMs to generate rich image descriptions and aligns their text embeddings with vision representations. The approach adds a cross-modal supervision signal via a contrastive loss, enabling CNNs and ViTs to benefit from the broad semantic knowledge encoded in LLMs without requiring expensive end-to-end LLM training. Empirical results on ImageNet-1K, CIFAR, and fine-grained datasets show consistent improvements across architectures, with notable gains on challenging tasks and evidence that class-conditioned prompts yield higher-quality descriptions and stronger alignment. The method offers a practical path for small teams to leverage LLM capabilities to boost perceptual tasks without large-scale LLM training, potentially accelerating real-world deployment of vision systems.

Abstract

The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations and achieve higher performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptions for training images. Then, these detailed descriptions are fed into a pre-trained encoder to extract text embeddings that encodes the rich semantics of images. During training, text embeddings will serve as extra supervising signal and be aligned with image representations learned by vision models. The alignment process helps vision models achieve better performance with the aid of pre-trained LLMs. We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks for heterogeneous model architectures.

GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

TL;DR

This work tackles the gap between high-capacity LLMs and task-specific vision models by proposing GPT4Image, a framework that uses multimodal LLMs to generate rich image descriptions and aligns their text embeddings with vision representations. The approach adds a cross-modal supervision signal via a contrastive loss, enabling CNNs and ViTs to benefit from the broad semantic knowledge encoded in LLMs without requiring expensive end-to-end LLM training. Empirical results on ImageNet-1K, CIFAR, and fine-grained datasets show consistent improvements across architectures, with notable gains on challenging tasks and evidence that class-conditioned prompts yield higher-quality descriptions and stronger alignment. The method offers a practical path for small teams to leverage LLM capabilities to boost perceptual tasks without large-scale LLM training, potentially accelerating real-world deployment of vision systems.

Abstract

The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations and achieve higher performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptions for training images. Then, these detailed descriptions are fed into a pre-trained encoder to extract text embeddings that encodes the rich semantics of images. During training, text embeddings will serve as extra supervising signal and be aligned with image representations learned by vision models. The alignment process helps vision models achieve better performance with the aid of pre-trained LLMs. We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks for heterogeneous model architectures.
Paper Structure (28 sections, 8 equations, 5 figures, 4 tables)

This paper contains 28 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overall diagram of the proposed GPT4Image training framework. Conventional vision models (e.g. CNN and ViT) can learn better representations with the assistance of pre-trained LLMs. Only the model within the dashed black box will be used for inference.
  • Figure 2: Examples of descriptions generated on CIFAR100. Words in color red are corresponding class names.
  • Figure 3: Image descriptions generated by different prompts and different multimodal LLMs. Information richness regarding the target object differs significantly. Green box: description from BLIP-2 li2023blip. Yellow box: description from MiniGPT-4 zhu2023minigpt.
  • Figure 4: t-SNE visualization of the description embeddings extracted by text encoder. Dots of the same color belong to the same category. Color bars indicates the corresponding index of the category. (a) Descriptions generated by BLIP-2 with plain prompt. (b) Descriptions generated by MiniGPT-4 with plain prompt. (c) Descriptions generated by MiniGPT-4 with class-conditioned prompt. (d) Only class names w/o using multimodal LLM.
  • Figure 5: Hyper-parameter sensitivity analysis of $\lambda$ and $\tau$ on ImageNet-1K dataset.