Table of Contents
Fetching ...

DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

Binbin Li, Guimiao Yang, Zisen Qi, Haiping Wang, Yu Ding

TL;DR

The paper addresses the efficiency-accuracy trade-off in image captioning by tackling the lack of visual grounding in retrieval-augmented, lightweight models. It proposes DualCap, a framework with dual retrieval: an image-to-text path to supply a textual prompt and an image-to-image path to generate a scene-keyword visual prompt, which are fused through a lightweight semantic fusion network and a residual visual feature update $V' = V + Z_{kp}$. The approach leverages a frozen CLIP-ViT-B/32 encoder and a GPT-2 decoder, training only the cross-attention layers and SFN, and achieves state-of-the-art performance among lightweight models on COCO, Flickr30k, and NoCaps with as few as 11M trainable parameters. The results demonstrate improved fine-grained visual grounding and generalization to novel objects, with competitive inference times, highlighting the practical value of decoupled retrieval and keyword-grounded visual prompts for efficient, high-quality captioning.

Abstract

Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.

DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

TL;DR

The paper addresses the efficiency-accuracy trade-off in image captioning by tackling the lack of visual grounding in retrieval-augmented, lightweight models. It proposes DualCap, a framework with dual retrieval: an image-to-text path to supply a textual prompt and an image-to-image path to generate a scene-keyword visual prompt, which are fused through a lightweight semantic fusion network and a residual visual feature update . The approach leverages a frozen CLIP-ViT-B/32 encoder and a GPT-2 decoder, training only the cross-attention layers and SFN, and achieves state-of-the-art performance among lightweight models on COCO, Flickr30k, and NoCaps with as few as 11M trainable parameters. The results demonstrate improved fine-grained visual grounding and generalization to novel objects, with competitive inference times, highlighting the practical value of decoupled retrieval and keyword-grounded visual prompts for efficient, high-quality captioning.

Abstract

Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose , a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.

Paper Structure

This paper contains 14 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) DualCap shows the best efficiency among the lightweight captioning models. (b) Comparison of total model parameters, the DualCap requires only 11M parameters, demonstrating its lightweight design.
  • Figure 2: (a) DualCap generates detailed captions by feeding a GPT-2 decoder two parallel prompts: a text prompt X from retrieved captions (I2T path), and an enhanced visual representation V'. The latter is created by using the SFN to generate a visual prompt $Z_{kp}$from keywords of similar scenes (I2I path), which is then added to the original image features V. (b) The architecture of the SFN, it employs a cross-attention mechanism where image patch features V act as the Query to attend to the semantic information from scene-keyword embeddings ($E_{kp}$), which serve as Key and Value.
  • Figure 3: Through using dual retrieval mechanism, DualCap outperforms baselines like SmallCapramos2022smallcap by capturing superior fine-grained detail.