Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding
Zahir Alsulaimawi
TL;DR
This work tackles hallucination in real-time vision-language systems by introducing a feedback-enhanced framework that couples YOLOv5 object detection with VILA1.5-3B language generation, guided by dynamic confidence threshold updates $\tau_{t+1} = \tau_t + \lambda (h_t - h_{target})$. The authors provide a formal problem definition, stability analysis, and practical design, achieving a 37% reduction in hallucinations while maintaining detection performance and real-time throughput (≈18 FPS). Empirical results across COCO, PASCAL VOC, and real-time video demonstrate improved scene coherence and robust grounding, supported by ablation studies showing synergistic benefits from adaptive thresholds and structured prompts. The approach offers a scalable pathway to safer, real-time multimodal perception in robotics, security, and assistive technologies.
Abstract
Real-time scene comprehension is a key advance in artificial intelligence, enhancing robotics, surveillance, and assistive tools. However, hallucination remains a challenge. AI systems often misinterpret visual inputs, detecting nonexistent objects or describing events that never happened. These errors, far from minor, threaten reliability in critical areas like security and autonomous navigation where accuracy is essential. Our approach tackles this by embedding self-awareness into the AI. Instead of trusting initial outputs, our framework continuously assesses them in real time, adjusting confidence thresholds dynamically. When certainty falls below a solid benchmark, it suppresses unreliable claims. Combining YOLOv5's object detection strength with VILA1.5-3B's controlled language generation, we tie descriptions to confirmed visual data. Strengths include dynamic threshold tuning for better accuracy, evidence-based text to reduce hallucination, and real-time performance at 18 frames per second. This feedback-driven design cuts hallucination by 37 percent over traditional methods. Fast, flexible, and reliable, it excels in applications from robotic navigation to security monitoring, aligning AI perception with reality.
