Table of Contents
Fetching ...

Text Data-Centric Image Captioning with Interactive Prompts

Yiyu Wang, Hao Luo, Jungang Xu, Yingfei Sun, Fan Wang

TL;DR

TIPCap tackles image captioning with limited and diverse data by introducing a text-centric framework that leverages CLIP and GPT-2. A multivariate Gaussian mapping $\mathcal{N}(\vec{\mu}, \Sigma)$ aligns text embeddings to the image space, complemented by a reverse mapping and an interactive prompts module that allows user guidance during generation. The method supports four data configurations, with trainable components and KL-based regularization to handle text-only or web data scenarios, achieving state-of-the-art performance among weakly supervised approaches on MS-COCO and Flickr30K and strong cross-domain generalization. This work advances practical captioning by reducing reliance on high-quality paired data and enabling flexible, prompt-informed caption generation. The proposed approach offers a data-efficient paradigm for deploying captioning systems in real-world, data-scarce environments.

Abstract

Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative language models (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap. 1) We consider four different settings which gradually reduce the dependence on paired data. 2) We construct a mapping module driven by multivariate Gaussian distribution to mitigate the modality gap, which is applicable to the above four different settings. 3) We propose a prompt interaction module that can incorporate optional prompt information before generating captions. Extensive experiments show that our TIPCap outperforms other weakly or unsupervised image captioning methods and achieves a new state-of-the-art performance on two widely used datasets, i.e., MS-COCO and Flickr30K.

Text Data-Centric Image Captioning with Interactive Prompts

TL;DR

TIPCap tackles image captioning with limited and diverse data by introducing a text-centric framework that leverages CLIP and GPT-2. A multivariate Gaussian mapping aligns text embeddings to the image space, complemented by a reverse mapping and an interactive prompts module that allows user guidance during generation. The method supports four data configurations, with trainable components and KL-based regularization to handle text-only or web data scenarios, achieving state-of-the-art performance among weakly supervised approaches on MS-COCO and Flickr30K and strong cross-domain generalization. This work advances practical captioning by reducing reliance on high-quality paired data and enabling flexible, prompt-informed caption generation. The proposed approach offers a data-efficient paradigm for deploying captioning systems in real-world, data-scarce environments.

Abstract

Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative language models (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap. 1) We consider four different settings which gradually reduce the dependence on paired data. 2) We construct a mapping module driven by multivariate Gaussian distribution to mitigate the modality gap, which is applicable to the above four different settings. 3) We propose a prompt interaction module that can incorporate optional prompt information before generating captions. Extensive experiments show that our TIPCap outperforms other weakly or unsupervised image captioning methods and achieves a new state-of-the-art performance on two widely used datasets, i.e., MS-COCO and Flickr30K.
Paper Structure (31 sections, 15 equations, 6 figures, 10 tables)

This paper contains 31 sections, 15 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison of different methods, $E_\mathcal{I}$, $E_\mathcal{T}$ and $D_\mathcal{T}$ indicate image encoder, text encoder and text decoder respectively. (a) supervised method. (b) ZeroCap and MAGIC. (c) DeCap. (d) CapDec and CLOSE. (e) our approach.
  • Figure 2: The overall framework of our approach. Our approach TIPCap is based on a pre-trained CLIP model and a pre-trained GPT-2 model. During training, we first exploit CLIP to extract CLIP text embedding and project it into CLIP image embedding space by a mapping module; then we reconstruct text embedding by a reverse mapping module and inject optional prompt information; finally, GPT-2 generates description. In the inference stage, we no longer need the mapping module but directly feed CLIP image embedding into reverse mapping module and follow-up modules to generate captions.
  • Figure 3: Examples of constructed full prompt sentences during stage 2 training.
  • Figure 4: Examples of captions generated by TIPCap with simulated interactive prompts, images come from MS-COCO karpathy test split. "Reference" indicates the generated caption withou prompt information; "Prompt" indicates the simulated user-specified prompt information; "Prediction" shows the new generated caption with prompt information.
  • Figure 5: Histogram visualization of the CLIP image and text embedding difference of MS-COCO training set, where orange and green indicate the histogram statistics on all dimensions (global) and specific dimensions (local) separately.
  • ...and 1 more figures