ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

Ahmad Khalil; Mahmoud Khalil; Alioune Ngom

ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

Ahmad Khalil, Mahmoud Khalil, Alioune Ngom

TL;DR

The paper tackles multi-modal hallucination in video-language models by focusing on ResNetVLLM and introducing a two-step approach: a faithfulness detection module that adapts Lynx for video to assess semantic alignment with ground-truth captions, and a hallucination mitigation module using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base built from ResNetVLLM projections. This combination cross-checks generated captions against external, contextually relevant evidence and revises outputs to improve factual grounding. Empirical results on ActivityNet-QA show the accuracy rising from $54.8\%$ to $65.3\%$, while faithfulness scores on ActivityNet Captions jump from 34.2\% to 97.9\%, indicating a substantial reduction in multi-modal hallucinations. The proposed framework offers a scalable, generalizable path to more reliable video-language systems, with potential for extending RAG grounding to multi-turn video dialogues and improved temporal alignment.

Abstract

Large Language Models (LLMs) have transformed natural language processing (NLP) tasks, but they suffer from hallucination, generating plausible yet factually incorrect content. This issue extends to Video-Language Models (VideoLLMs), where textual descriptions may inaccurately represent visual content, resulting in multi-modal hallucinations. In this paper, we address hallucination in ResNetVLLM, a video-language model combining ResNet visual encoders with LLMs. We introduce a two-step protocol: (1) a faithfulness detection strategy that uses a modified Lynx model to assess semantic alignment between generated captions and ground-truth video references, and (2) a hallucination mitigation strategy using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base dynamically constructed during inference. Our enhanced model, ResNetVLLM-2, reduces multi-modal hallucinations by cross-verifying generated content against external knowledge, improving factual consistency. Evaluation on the ActivityNet-QA benchmark demonstrates a substantial accuracy increase from 54.8% to 65.3%, highlighting the effectiveness of our hallucination detection and mitigation strategies in enhancing video-language model reliability.

ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

TL;DR

Abstract

ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)