Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
Wenyan Li, Jiaang Li, Rita Ramos, Raphael Tang, Desmond Elliott
TL;DR
This paper investigates the robustness of the SmallCap retrieval-augmented image captioning model to retrieved content. It analyzes how the order of retrieved captions and the relevance of their content affect generation, and introduces a majority-token perspective to explain why models copy frequently occurring tokens from retrieved captions. The authors demonstrate that SmallCap is order-robust but content-sensitive, with a strong tendency to copy majority tokens into generated captions, and they validate this via input attribution and attention analyses. To mitigate this bias, they propose sampling retrieved captions from larger candidate lists during training (sample-$k$ and c-sample-$k$), which improves in-domain and cross-domain performance, including NoCaps and VizWiz, and reduces reliance on top-k captions. The work highlights practical implications for robustness in retrieval-augmented captioning and suggests future directions like token-dropping and prefix-tuning to further enhance resilience to retrieval noise.
Abstract
Recent advances in retrieval-augmented models for image captioning highlight the benefit of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice: the retrieved information can sometimes mislead the model, resulting in incorrect generation and worse performance. In this paper, we analyze the robustness of a retrieval-augmented captioning model SmallCap. Our analysis shows that the model is sensitive to tokens that appear in the majority of the retrieved captions, and the input attribution shows that those tokens are likely copied into the generated output. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This decreases the chance that the model learns to copy majority tokens, and improves both in-domain and cross-domain performance.
