Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

Zhijie Tan; Yuzhi Li; Shengwei Meng; Xiang Yuan; Weiping Li; Tong Mo; Bingce Wang; Xu Chu

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

Zhijie Tan, Yuzhi Li, Shengwei Meng, Xiang Yuan, Weiping Li, Tong Mo, Bingce Wang, Xu Chu

TL;DR

This work addresses HoOA in LVLMs by introducing a dedicated benchmark and a mitigation pipeline that uses multiview prompts generated from single-image 3D reconstructions. The proposed MIAVLM architecture with a Multiview Attributes Perceiver (MAP) aggregates multiple visual prompts, neutralizes input-order effects, and aligns visual cues with a frozen LLM, while employing negative instructions to curb Yes-bias. Experiments on the HoOA benchmark show improved attribute-consistent responses over baselines and demonstrate the importance of separate multiview inputs over single-image or 9-in-1 concatenations. The approach offers practical improvements for robust fine-grained attribute reasoning in LVLMs and highlights the role of multiview visual prompting and instruction design in reducing hallucinations.

Abstract

Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 5 figures, 2 tables)

This paper contains 11 sections, 4 equations, 5 figures, 2 tables.

Introduction
Method
Model Architecture
Visual Extractor
Multihead Sampler
Multiview Attributes Perceiver
Experiments
Benchmark Settings and Implementation Details
The Performance of LVLMs on HoOA Benchmark
The Influence of Multiview Images Input Order on LVLMs
Conclusion

Figures (5)

Figure 1: Illustration of the HoOE Problem.
Figure 2: An overview of the MIAVLM model. Frozen parts are blue and marked with a snowflake while trainable parts are red and marked with a flame.
Figure 3: An overview of the Multihead Sampler.
Figure 4: The structure of Multiview Attributes Perceiver.
Figure 5: The influence of multiview images input order on OpenFlamingo awadalla2023openflamingo and MIAVLM (ours). : Outlier. Yellow line: Median. OF: OpenFlamingo.

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

TL;DR

Abstract

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

Authors

TL;DR

Abstract

Table of Contents

Figures (5)