Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions
Zhijie Tan, Yuzhi Li, Shengwei Meng, Xiang Yuan, Weiping Li, Tong Mo, Bingce Wang, Xu Chu
TL;DR
This work addresses HoOA in LVLMs by introducing a dedicated benchmark and a mitigation pipeline that uses multiview prompts generated from single-image 3D reconstructions. The proposed MIAVLM architecture with a Multiview Attributes Perceiver (MAP) aggregates multiple visual prompts, neutralizes input-order effects, and aligns visual cues with a frozen LLM, while employing negative instructions to curb Yes-bias. Experiments on the HoOA benchmark show improved attribute-consistent responses over baselines and demonstrate the importance of separate multiview inputs over single-image or 9-in-1 concatenations. The approach offers practical improvements for robust fine-grained attribute reasoning in LVLMs and highlights the role of multiview visual prompting and instruction design in reducing hallucinations.
Abstract
Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.
