Table of Contents
Fetching ...

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao

TL;DR

This work tackles zero-shot image captioning (ZIC) by leveraging synthetic images from text-to-image diffusion to train captioning models, addressing semantic misalignment caused by unfaithful details. It introduces PCM-Net, which uses a patch-wise cross-modal feature mix-up to align fine-grained spatial features with salient textual concepts, and a CLIP-weighted cross-entropy loss to emphasize high-quality synthetic pairs. The method is validated on MSCOCO and Flickr30k, delivering state-of-the-art results in both in-domain and cross-domain ZIC and showing robustness across ablations and competing frameworks. Overall, PCM-Net demonstrates that fine-grained cross-modal alignment with diffusion-based synthetic data can substantially improve zero-shot captioning with reduced reliance on real paired data.

Abstract

Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

TL;DR

This work tackles zero-shot image captioning (ZIC) by leveraging synthetic images from text-to-image diffusion to train captioning models, addressing semantic misalignment caused by unfaithful details. It introduces PCM-Net, which uses a patch-wise cross-modal feature mix-up to align fine-grained spatial features with salient textual concepts, and a CLIP-weighted cross-entropy loss to emphasize high-quality synthetic pairs. The method is validated on MSCOCO and Flickr30k, delivering state-of-the-art results in both in-domain and cross-domain ZIC and showing robustness across ablations and competing frameworks. Overall, PCM-Net demonstrates that fine-grained cross-modal alignment with diffusion-based synthetic data can substantially improve zero-shot captioning with reduced reliance on real paired data.

Abstract

Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.
Paper Structure (16 sections, 11 equations, 5 figures, 4 tables)

This paper contains 16 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Training paradigms for image captioning: (a) Training with well-aligned image-sentence pairs for Supervised Image Captioning (SIC); (b) Training with text-only data in VLMs-based model for Zero-shot Image Captioning (ZIC); (c) Training with synthetic image-text pairs in our PCM-Net for ZIC.
  • Figure 2: An overview of our proposed PCM-Net. The flawed or unfaithful patches (e.g., poor facial details or missing limbs) in the salient regions of synthetic images would be replaced by semantically aligned textual patches in (a) Patch-wise Cross-modal Feature Mix-up. The mixed-up features are further encoded by (b) Visual-semantic Encoder, followed by (c) Sentence Decoder for caption generation.
  • Figure 3: Examples of image captioning results generated by DeCap li2023decap, ViECap fei2023transferable and our PCM-Net, coupled with the corresponding ground-truth sentences(Reference).
  • Figure 4: Ablation study on hyperparameters in PCM-Net on MSCOCO.
  • Figure 5: Performance comparison of PCM-Net and SynTIC on MSCOCO with various proportions of training data.