UniRAG: Universal Retrieval Augmentation for Large Vision Language Models
Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, Jimmy Lin
TL;DR
UniRAG提出了一种模型无关的检索增强方法,通过在推理阶段将相关的跨模态检索样本作为少量示例加入到提示中来改善大型视觉语言模型的生成质量。它使用 UniIR 视觉语言检索器来检索跨模态的图文对,并在生成阶段将这些对作为提示示例注入,提升图像描述和文本到图像生成的表现。实验在 MSCOCO 和 Fashion200k 数据集上,覆盖多种 LVLM(包括开源和专有模型),在多项指标(如 SPICE、FID、CLIP-score)上显示出显著的增益,证明了以检索示例驱动的推理对多模态任务具有广泛适用性。研究同时讨论了应用中的许可、语言覆盖、延迟与安全等局限性,指出未来工作可拓展至非英语数据和更高责任性评估。
Abstract
Recently, Large Vision Language Models (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelityof LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by Vision-Language (VL) retrievers like UniIR models. All the necessary code to reproduce our results is available at https://github.com/castorini/UniRAG
