Text-Only Training for Image Captioning using Noise-Injected CLIP
David Nukrai, Ron Mokady, Amir Globerson
TL;DR
CapDec reformulates image captioning as decoding CLIP embeddings back into text using a frozen CLIP encoder and a trainable decoder, trained with text-only data. To counteract the modality gap between image and text embeddings, it injects Gaussian noise into CLIP text embeddings during training, creating a robust mapping from images to captions without paired image-caption data. The method achieves state-of-the-art zero-shot results across standard and style-guided captioning benchmarks, and demonstrates competitive cross-domain generalization, with an analysis showing an optimal noise level around ε^2 ≈ 0.016. This approach enables flexible style transfer and reduces reliance on large captioned image datasets, increasing practicality for diverse vision-language applications.
Abstract
We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.
