Towards Retrieval-Augmented Architectures for Image Captioning

Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Alessandro Nicolosi; Rita Cucchiara

Towards Retrieval-Augmented Architectures for Image Captioning

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

TL;DR

This work tackles image captioning by decoupling memorization from generation through an external knowledge memory. It introduces two fully-attentive, retrieval-augmented Transformer architectures, RA-T$^\mathcal{S}$ and RA-T$^\mathcal{X}$, that condition caption generation on retrieved text retrieved via kNN over a CLIP-based visual embedding space. The paper demonstrates that larger, higher-quality retrieval corpora (e.g., CC3M with BLIP-generated captions) substantially boost CIDEr scores on COCO and nocaps, with gains sustained after CIDEr-based reinforcement learning. Practically, the results suggest scalable improvements in caption quality by integrating external memory without backpropagating through the memory, enabling richer semantic grounding and better generalization to novel objects.

Abstract

The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.

Towards Retrieval-Augmented Architectures for Image Captioning

TL;DR

and RA-T

, that condition caption generation on retrieved text retrieved via kNN over a CLIP-based visual embedding space. The paper demonstrates that larger, higher-quality retrieval corpora (e.g., CC3M with BLIP-generated captions) substantially boost CIDEr scores on COCO and nocaps, with gains sustained after CIDEr-based reinforcement learning. Practically, the results suggest scalable improvements in caption quality by integrating external memory without backpropagating through the memory, enabling richer semantic grounding and better generalization to novel objects.

Abstract

Paper Structure (18 sections, 10 equations, 5 figures, 5 tables)

This paper contains 18 sections, 10 equations, 5 figures, 5 tables.

Introduction
Related Work
Proposed Method
Preliminaries
External Memory and Knowledge Retrieval
Designing Retrieval-Augmented Language Models
Visual Encoder
Textual Decoder
Retrieval-Augmented Generation
RA-T$^\mathcal{S}$
RA-T$^\mathcal{X}$
Training Protocol
Experiments
Experimental Setup
Quality of Nearest Neighbor Captions
...and 3 more sections

Figures (5)

Figure 1: Comparison between a standard captioner (left) and the proposed retrieval-augmented captioning schema (right), in which an external memory is employed to condition the generation process.
Figure 2: Schema of the proposed knowledge retriever component (see Fig. \ref{['fig:architectures']} for architectural details of the language models). Given an input image, visual features are extracted using a CLIP-based image encoder. These features are then used to retrieve a set of similar textual sentences, starting from the corresponding image representations, that are employed as additional knowledge during the generation of the caption.
Figure 3: Architectural schema of the RA-T$^\mathcal{S}$ (self-attention-based) and RA-T$^\mathcal{X}$ (cross-attention-based) language models. In RA-T$^\mathcal{S}$, retrieved captions are employed as prefix of the decoder textual sequence, after removing stop words and duplicate words. In RA-T$^\mathcal{X}$, instead, retrieved captions are first passed through a Transformer encoder and then used in a $k$NN cross-attention layer inside the captioner decoder. The contribution of retrieved captions is regulated by a learnable gating mechanism that combines the output of the $k$NN cross-attention layer with those of the standard self-attention over the input sequence.
Figure 4: Generated captions on sample images from the COCO dataset, along with five retrieved captions.
Figure 5: Qualitative results on sample images from nocaps, comparing captions generated by our model with those generated by a standard Transformer without retrieval.

Towards Retrieval-Augmented Architectures for Image Captioning

TL;DR

Abstract

Towards Retrieval-Augmented Architectures for Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)