Table of Contents
Fetching ...

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

Antonio Carlos Rivera, Anthony Moore, Steven Robinson

TL;DR

The paper addresses the challenge of object-aware reasoning in vision-language tasks by introducing VRAP, a purely LLM-driven framework that augments prompts with retrieval-augmented object tags derived from visual encoders and scene graph parsers. VRAP replaces runtime multimodal retrieval with offline, enriched textual prompts, enabling detailed reasoning about objects, attributes, and relationships and reducing latency. Through multi-task training with generative, contrastive, and auxiliary losses, VRAP achieves state-of-the-art performance on VQAv2, GQA, VizWiz, and COCO, while ablational analyses highlight the crucial role of retrieval-augmented tags and tag relevance learning. Human evaluations confirm improved accuracy, relevance, and detail, and the framework demonstrates robustness to unseen objects and favorable scalability with larger datasets alongside a 40% inference-time reduction, highlighting practical impact for efficient, interpretable multimodal reasoning.

Abstract

Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP's ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

TL;DR

The paper addresses the challenge of object-aware reasoning in vision-language tasks by introducing VRAP, a purely LLM-driven framework that augments prompts with retrieval-augmented object tags derived from visual encoders and scene graph parsers. VRAP replaces runtime multimodal retrieval with offline, enriched textual prompts, enabling detailed reasoning about objects, attributes, and relationships and reducing latency. Through multi-task training with generative, contrastive, and auxiliary losses, VRAP achieves state-of-the-art performance on VQAv2, GQA, VizWiz, and COCO, while ablational analyses highlight the crucial role of retrieval-augmented tags and tag relevance learning. Human evaluations confirm improved accuracy, relevance, and detail, and the framework demonstrates robustness to unseen objects and favorable scalability with larger datasets alongside a 40% inference-time reduction, highlighting practical impact for efficient, interpretable multimodal reasoning.

Abstract

Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP's ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.

Paper Structure

This paper contains 30 sections, 9 equations, 7 tables.