Table of Contents
Fetching ...

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

Hao Wu, Zhihang Zhong, Xiao Sun

TL;DR

DIR tackles the challenge of out-of-domain generalization in image captioning by introducing diffusion-guided retrieval and a high-quality retrieval database that together produce a more comprehensive understanding of images. It trains the image encoder with diffusion guidance via a denoising objective $\mathcal{L}_{\text{denoise}}$ alongside the caption objective $\mathcal{L}_{\text{caption}}$, yielding $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{caption}} + \lambda \mathcal{L}_{\text{denoise}}$ and enabling richer, more transferable image features. The retrieval database captures objects, actions, and environments and is aligned with EVA-CLIP features, while a Text Q-Former fuses retrieved text with image features to guide captioning; experiments show strong out-of-domain gains on Flickr30k and NoCaps with competitive in-domain results and no added inference cost. The work advances practical retrieval-augmented captioning by combining diffusion-based representation learning with a richer textual memory, improving caption richness and robustness to domain shifts.

Abstract

Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

TL;DR

DIR tackles the challenge of out-of-domain generalization in image captioning by introducing diffusion-guided retrieval and a high-quality retrieval database that together produce a more comprehensive understanding of images. It trains the image encoder with diffusion guidance via a denoising objective alongside the caption objective , yielding and enabling richer, more transferable image features. The retrieval database captures objects, actions, and environments and is aligned with EVA-CLIP features, while a Text Q-Former fuses retrieved text with image features to guide captioning; experiments show strong out-of-domain gains on Flickr30k and NoCaps with competitive in-domain results and no added inference cost. The work advances practical retrieval-augmented captioning by combining diffusion-based representation learning with a richer textual memory, improving caption richness and robustness to domain shifts.

Abstract

Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of our method with previous retrieval-augmented image captioning approaches. (a) Diffusion-Guided Retrieval Enhancement: An image can be described from multiple valid perspectives. Previous methods optimize retrieval features to predict only GT captions, ignoring other perspectives (highlighted in the left purple dotted box). In contrast, our approach leverages additional diffusion guidance to ensure that image features capture both GT captions and the inherent content of the image, enabling the inclusion of alternative descriptive perspectives. (b) High-Quality Retrieval Database: Previous methods often rely on raw captions or parsed objects, which are either verbose or overly simplistic. Our approach uses a diverse retrieval text database that captures a broader range of image aspects, including objects, actions, and environments, leading to more contextually rich and accurate captions. Please see \ref{['fig:rt_feat']} and \ref{['fig:rt_cmp']} for examples illustrating the results with and without these methods.
  • Figure 2: The architecture of our retrieval-augmented image captioning framework employs an image encoder and Q-Former from BLIP2, guided by a pretrained text-to-image diffusion model, to extract comprehensive image features for retrieval. The retrieved text features are fused with the image features through a Text Q-Former, and the combined features are then passed to an LLM for final caption generation.
  • Figure 3: Qualitative comparison of model performance with and without diffusion guidance. The relevant retrieval results are highlighted in bold and red.
  • Figure 4: Qualitative comparison of model predictions using EVCap's retrieval database and our proposed retrieval database. The relevant retrieval results are highlighted in bold and red.
  • Figure 5: Comparison of model performance with varying top-$n$ values on the CIDEr metric across COCO, Flickr30k, and the out-of-domain subset of NoCaps.
  • ...and 2 more figures