UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

Sahel Sharifymoghaddam; Shivani Upadhyay; Wenhu Chen; Jimmy Lin

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, Jimmy Lin

TL;DR

UniRAG提出了一种模型无关的检索增强方法，通过在推理阶段将相关的跨模态检索样本作为少量示例加入到提示中来改善大型视觉语言模型的生成质量。它使用 UniIR 视觉语言检索器来检索跨模态的图文对，并在生成阶段将这些对作为提示示例注入，提升图像描述和文本到图像生成的表现。实验在 MSCOCO 和 Fashion200k 数据集上，覆盖多种 LVLM（包括开源和专有模型），在多项指标（如 SPICE、FID、CLIP-score）上显示出显著的增益，证明了以检索示例驱动的推理对多模态任务具有广泛适用性。研究同时讨论了应用中的许可、语言覆盖、延迟与安全等局限性，指出未来工作可拓展至非英语数据和更高责任性评估。

Abstract

Recently, Large Vision Language Models (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelityof LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by Vision-Language (VL) retrievers like UniIR models. All the necessary code to reproduce our results is available at https://github.com/castorini/UniRAG

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

TL;DR

Abstract

Paper Structure (22 sections, 6 figures, 11 tables)

This paper contains 22 sections, 6 figures, 11 tables.

Introduction
Related Work
Retrieval Augmentation with Generative Models:
Vision-Language Models and Retrieval Augmentation:
Methodology
Retrieval
Generation
Experimental Setup
Selected Models
Datasets
Configuration Details
Evaluation Results and Analysis
Caption Generation
Image Generation
Effect of Sampling
...and 7 more sections

Figures (6)

Figure 1: An Overview of the UniRAG technique with image captioning (blue) and image generation (green) tasks. UniRAG retrieves relevant image-text pairs and adds them as few-shot examples to the LVLM's input prompt.
Figure 2: Zero-shot prompting for caption generation.
Figure 3: Few-shot prompting for caption generation with LlaVA.
Figure 4: Few-shot prompting for caption generation with Gemini-Pro.
Figure 5: Few-shot prompting for caption generation with GPT-4o.
...and 1 more figures

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

TL;DR

Abstract

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)