Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation
Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong Cai
TL;DR
RaPSG tackles annotation-free image captioning by distilling knowledge from frozen large pre-trained models and reinforcing learning with a retrieval module that pulls relevant region descriptions from Visual Genome. It generates multiple high-quality pseudo sentences via two-stage processing (region retrieval plus summarization, followed by distillation with LLaMA-7B) and stabilizes training through a fluency filter and a CLIP-guided contrastive objective. The approach achieves state-of-the-art results across zero-shot, unsupervised, semi-supervised, and cross-domain benchmarks with far fewer trainable parameters and less external data than large pre-trained models. This yields a data-efficient, flexible captioning framework suitable for diverse deployment scenarios and reduces dependence on costly supervised data.
Abstract
Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental results indicate that our method outperforms SOTA captioning models across various settings including zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Code is available at: https://github.com/Zhiyuan-Li-John/RaPSG.
