GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task
Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang
TL;DR
This work tackles the gap between high-capacity LLMs and task-specific vision models by proposing GPT4Image, a framework that uses multimodal LLMs to generate rich image descriptions and aligns their text embeddings with vision representations. The approach adds a cross-modal supervision signal via a contrastive loss, enabling CNNs and ViTs to benefit from the broad semantic knowledge encoded in LLMs without requiring expensive end-to-end LLM training. Empirical results on ImageNet-1K, CIFAR, and fine-grained datasets show consistent improvements across architectures, with notable gains on challenging tasks and evidence that class-conditioned prompts yield higher-quality descriptions and stronger alignment. The method offers a practical path for small teams to leverage LLM capabilities to boost perceptual tasks without large-scale LLM training, potentially accelerating real-world deployment of vision systems.
Abstract
The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations and achieve higher performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptions for training images. Then, these detailed descriptions are fed into a pre-trained encoder to extract text embeddings that encodes the rich semantics of images. During training, text embeddings will serve as extra supervising signal and be aligned with image representations learned by vision models. The alignment process helps vision models achieve better performance with the aid of pre-trained LLMs. We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks for heterogeneous model architectures.
