Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Junxiao Xue, Quan Deng, Fei Yu, Yanhao Wang, Jun Wang, Yuehua Li
TL;DR
This paper tackles the difficulty of accurate object counting, localization, and spatial reasoning in VQA by introducing a multimodal RAG-LLM framework that constructs structured scene graphs from images, stores them as semantic chunks in a vector database, and retrieves relevant context to inform a semantic-enhanced prompt. An LLM-based VQA module (Qwen-2-72B-Instruct) then generates answers grounded in both the image-derived structure and retrieved data, reducing hallucination and improving precision. The approach yields substantial gains over strong baselines on VG-150 and AUG, particularly in per-category counts, absolute locations, and relationships, across both first-person and aerial imagery. These results underscore the method’s practical potential for complex multimodal tasks in robotics, remote sensing, and IoT where fine-grained visual reasoning is essential.
Abstract
Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM's capacity to handle tasks requiring precise visual descriptions, especially in scenarios with challenging perspectives, such as aerial views or scenes with dense object arrangements. Finally, we conduct extensive experiments on the VG-150 dataset that focuses on first-person visual understanding and the AUG dataset that involves aerial imagery. The results show that our approach consistently outperforms existing MLLMs in VQA tasks, which stands out in recognizing, localizing, and quantifying objects in different spatial contexts and provides more accurate visual descriptions.
