Table of Contents
Fetching ...

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong Cai

TL;DR

RaPSG tackles annotation-free image captioning by distilling knowledge from frozen large pre-trained models and reinforcing learning with a retrieval module that pulls relevant region descriptions from Visual Genome. It generates multiple high-quality pseudo sentences via two-stage processing (region retrieval plus summarization, followed by distillation with LLaMA-7B) and stabilizes training through a fluency filter and a CLIP-guided contrastive objective. The approach achieves state-of-the-art results across zero-shot, unsupervised, semi-supervised, and cross-domain benchmarks with far fewer trainable parameters and less external data than large pre-trained models. This yields a data-efficient, flexible captioning framework suitable for diverse deployment scenarios and reduces dependence on costly supervised data.

Abstract

Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental results indicate that our method outperforms SOTA captioning models across various settings including zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Code is available at: https://github.com/Zhiyuan-Li-John/RaPSG.

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

TL;DR

RaPSG tackles annotation-free image captioning by distilling knowledge from frozen large pre-trained models and reinforcing learning with a retrieval module that pulls relevant region descriptions from Visual Genome. It generates multiple high-quality pseudo sentences via two-stage processing (region retrieval plus summarization, followed by distillation with LLaMA-7B) and stabilizes training through a fluency filter and a CLIP-guided contrastive objective. The approach achieves state-of-the-art results across zero-shot, unsupervised, semi-supervised, and cross-domain benchmarks with far fewer trainable parameters and less external data than large pre-trained models. This yields a data-efficient, flexible captioning framework suitable for diverse deployment scenarios and reduces dependence on costly supervised data.

Abstract

Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental results indicate that our method outperforms SOTA captioning models across various settings including zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Code is available at: https://github.com/Zhiyuan-Li-John/RaPSG.
Paper Structure (17 sections, 2 equations, 7 figures, 5 tables)

This paper contains 17 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The comparison between whole sentence crawling strategy kakaobrain2022coyo-700m and our generation-based RaPSG method.
  • Figure 2: The overview of our proposed framework. It is structured around three core components: RaPSG, fluency filter, and CLIP guidance. Notably, both the fluency filter and CLIP guidance modules are designed to be frozen, eliminating the need for further parameter training.
  • Figure 3: The Stage-I of RaPSG framework. Firstly, we retrieve top-$k$ region descriptions from VG krishna2017visual according to their matching scores computed by CLIP radford2021learning model. Then, we use Sent-BERT reimers2019sentence model to divide them into four groups by their semantic similarity. Finally, BART lewis2019bart model is used to summarize the grouped descriptions for four pseudo sentences.
  • Figure 4: The Stage-II of RaPSG framework. Initially, we utilize the provided image in conjunction with the preceding four pseudo sentences as supervision to train the image captioner. Once trained, we freeze the captioner and generate a prediction sentence. To enhance the generation process, we incorporate the top-$k$ most relevant region descriptions as supplementary material to get the fifth output.
  • Figure 5: A comparison of two pseudo sentences in RaPSG process. The first sentence appears more fluent than the second sentence from the human view.
  • ...and 2 more figures