Table of Contents
Fetching ...

Text-Only Training for Image Captioning using Noise-Injected CLIP

David Nukrai, Ron Mokady, Amir Globerson

TL;DR

CapDec reformulates image captioning as decoding CLIP embeddings back into text using a frozen CLIP encoder and a trainable decoder, trained with text-only data. To counteract the modality gap between image and text embeddings, it injects Gaussian noise into CLIP text embeddings during training, creating a robust mapping from images to captions without paired image-caption data. The method achieves state-of-the-art zero-shot results across standard and style-guided captioning benchmarks, and demonstrates competitive cross-domain generalization, with an analysis showing an optimal noise level around ε^2 ≈ 0.016. This approach enables flexible style transfer and reduces reliance on large captioned image datasets, increasing practicality for diverse vision-language applications.

Abstract

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.

Text-Only Training for Image Captioning using Noise-Injected CLIP

TL;DR

CapDec reformulates image captioning as decoding CLIP embeddings back into text using a frozen CLIP encoder and a trainable decoder, trained with text-only data. To counteract the modality gap between image and text embeddings, it injects Gaussian noise into CLIP text embeddings during training, creating a robust mapping from images to captions without paired image-caption data. The method achieves state-of-the-art zero-shot results across standard and style-guided captioning benchmarks, and demonstrates competitive cross-domain generalization, with an analysis showing an optimal noise level around ε^2 ≈ 0.016. This approach enables flexible style transfer and reduces reliance on large captioned image datasets, increasing practicality for diverse vision-language applications.

Abstract

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.
Paper Structure (21 sections, 1 equation, 4 figures, 2 tables)

This paper contains 21 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our CapDec captioning approach. (a) An illustration of the CLIP joint embedding space. Embedded text is relatively close to its corresponding visual embedding, but with a certain gap. (b) CapDec trains a model that decodes the CLIP embedding of text $T$ back to text $T$, after noise-injection. The encoders remain frozen. (c) At inference, CapDec simply decodes the embedding of an image using the trained decoder.
  • Figure 2: Example for styled captions of CapDec on FlickrStyle10K gan2017stylenet.
  • Figure 3: The effect of the noise variance on MS-COCO performance.
  • Figure 4: Analysis of performance of different methods as a function of the noise level (see Sec.\ref{['sec:NoiseAblation']}). We show the CiDER metric (higher is better), as other metrics show similar trends. CapDec here is the same as in Fig.\ref{['figures:varianceXmetrics.png']}