Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

Wenyan Li; Jiaang Li; Rita Ramos; Raphael Tang; Desmond Elliott

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

Wenyan Li, Jiaang Li, Rita Ramos, Raphael Tang, Desmond Elliott

TL;DR

This paper investigates the robustness of the SmallCap retrieval-augmented image captioning model to retrieved content. It analyzes how the order of retrieved captions and the relevance of their content affect generation, and introduces a majority-token perspective to explain why models copy frequently occurring tokens from retrieved captions. The authors demonstrate that SmallCap is order-robust but content-sensitive, with a strong tendency to copy majority tokens into generated captions, and they validate this via input attribution and attention analyses. To mitigate this bias, they propose sampling retrieved captions from larger candidate lists during training (sample-$k$ and c-sample-$k$), which improves in-domain and cross-domain performance, including NoCaps and VizWiz, and reduces reliance on top-k captions. The work highlights practical implications for robustness in retrieval-augmented captioning and suggests future directions like token-dropping and prefix-tuning to further enhance resilience to retrieval noise.

Abstract

Recent advances in retrieval-augmented models for image captioning highlight the benefit of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice: the retrieved information can sometimes mislead the model, resulting in incorrect generation and worse performance. In this paper, we analyze the robustness of a retrieval-augmented captioning model SmallCap. Our analysis shows that the model is sensitive to tokens that appear in the majority of the retrieved captions, and the input attribution shows that those tokens are likely copied into the generated output. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This decreases the chance that the model learns to copy majority tokens, and improves both in-domain and cross-domain performance.

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

TL;DR

and c-sample-

), which improves in-domain and cross-domain performance, including NoCaps and VizWiz, and reduces reliance on top-k captions. The work highlights practical implications for robustness in retrieval-augmented captioning and suggests future directions like token-dropping and prefix-tuning to further enhance resilience to retrieval noise.

Abstract

Paper Structure (30 sections, 2 equations, 12 figures, 10 tables)

This paper contains 30 sections, 2 equations, 12 figures, 10 tables.

Introduction
Related Work
Retrieval-augmented image captioning.
Robustness of Retrieval-Augmented Image Captioning
Robustness Evaluation
Experimental Setup
Order Robust but Content Sensitive
Order robust.
Content sensitive.
Majority Tokens Explain Behavior
Majority Tokens
Experimental Setup
Results.
Input Attribution with Integrated Gradients
Attention and Model Behavior
...and 15 more sections

Figures (12)

Figure 1: Comparison of generated image captions that are predicted without retrieval, misled by retrieval, and predicted with a more retrieval-robust model. The retrieval-augmented model generates the token "elephant", which appears in 3/4 of the retrieved captions.
Figure 2: CIDEr evaluation of SmallCap on the COCO validation set using the top-$k$, low(er)-ranked, randomly retrieved captions, against a baseline without retrieval augmentation. Performance drops by up to 50% when using randomly retrieved captions compared to baseline, suggesting that the model is not robust.
Figure 3: Input attribution for each generated token (y-axis). The brighter the color, the more greater the attribution from the input token. We observe high attribution scores to "umbrella", "boy", "cattle", and "over".
Figure 4: Pairwise average attribution score between retrieved and generated tokens in the 2B1G setup. MT: majority tokens in the retrieved captions. OT: all other tokens. The larger pairwise attribution values shows that the majority tokens have a larger impact during generation than the other tokens in the retrieved captions.
Figure 5: Statistics of all maximum attention scores' distribution across different layers and heads from self and cross attention. $XA$ denotes cross attention, while $SA$ signifies self-attention. $img$ represents the distribution of maximum attention scores across image patches, whereas $text$ pertains to the distribution of maximum attention scores across text tokens.
...and 7 more figures

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

TL;DR

Abstract

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)