DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
Binbin Li, Guimiao Yang, Zisen Qi, Haiping Wang, Yu Ding
TL;DR
The paper addresses the efficiency-accuracy trade-off in image captioning by tackling the lack of visual grounding in retrieval-augmented, lightweight models. It proposes DualCap, a framework with dual retrieval: an image-to-text path to supply a textual prompt and an image-to-image path to generate a scene-keyword visual prompt, which are fused through a lightweight semantic fusion network and a residual visual feature update $V' = V + Z_{kp}$. The approach leverages a frozen CLIP-ViT-B/32 encoder and a GPT-2 decoder, training only the cross-attention layers and SFN, and achieves state-of-the-art performance among lightweight models on COCO, Flickr30k, and NoCaps with as few as 11M trainable parameters. The results demonstrate improved fine-grained visual grounding and generalization to novel objects, with competitive inference times, highlighting the practical value of decoupled retrieval and keyword-grounded visual prompts for efficient, high-quality captioning.
Abstract
Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.
