Table of Contents
Fetching ...

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Rui Fonseca, Bruno Martins, Gil Rocha

TL;DR

<3-5 sentence high-level summary> TOMCap tackles image captioning without image-caption pairs by unifying retrieval-augmented generation, CLIP-based latent representations, and a modality gap correction that aligns image and text embeddings. The method trains only cross-attention and rsLoRA components while leveraging a large textual datastore to guide caption generation, and uses an image-conditioned prompt at inference via retrieved captions. Empirical results on MSCOCO and NoCaps show TOMCap surpasses previous training-free and text-only methods and demonstrates robustness to retrieval configurations and domain shifts. The work highlights the importance of modality-gap correction and retrieval quality for bridging the gap between textual priors and visual grounding in a text-only regime.

Abstract

Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

TL;DR

<3-5 sentence high-level summary> TOMCap tackles image captioning without image-caption pairs by unifying retrieval-augmented generation, CLIP-based latent representations, and a modality gap correction that aligns image and text embeddings. The method trains only cross-attention and rsLoRA components while leveraging a large textual datastore to guide caption generation, and uses an image-conditioned prompt at inference via retrieved captions. Empirical results on MSCOCO and NoCaps show TOMCap surpasses previous training-free and text-only methods and demonstrates robustness to retrieval configurations and domain shifts. The work highlights the importance of modality-gap correction and retrieval quality for bridging the gap between textual priors and visual grounding in a text-only regime.

Abstract

Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

Paper Structure

This paper contains 21 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of the training (red) and inference (blue) procedures associated to TOMCap.
  • Figure 2: Example illustrating the use of the considered prompt template with a set of retrieved captions (blue).
  • Figure 3: Examples of captions retrieved and generated by TOMCap.
  • Figure 5: KNOR scores for $k\in\{5,10,15,50,100\}$. On the left, we use a comparison between not applying any correction ("baseline") and applying the mean/standard deviation correction. On the right, we show the impact of varying the Gaussian noise magnitude.
  • Figure : No correction.
  • ...and 3 more figures