Table of Contents
Fetching ...

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

Zhijie Tan, Yuzhi Li, Shengwei Meng, Xiang Yuan, Weiping Li, Tong Mo, Bingce Wang, Xu Chu

TL;DR

This work addresses HoOA in LVLMs by introducing a dedicated benchmark and a mitigation pipeline that uses multiview prompts generated from single-image 3D reconstructions. The proposed MIAVLM architecture with a Multiview Attributes Perceiver (MAP) aggregates multiple visual prompts, neutralizes input-order effects, and aligns visual cues with a frozen LLM, while employing negative instructions to curb Yes-bias. Experiments on the HoOA benchmark show improved attribute-consistent responses over baselines and demonstrate the importance of separate multiview inputs over single-image or 9-in-1 concatenations. The approach offers practical improvements for robust fine-grained attribute reasoning in LVLMs and highlights the role of multiview visual prompting and instruction design in reducing hallucinations.

Abstract

Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

TL;DR

This work addresses HoOA in LVLMs by introducing a dedicated benchmark and a mitigation pipeline that uses multiview prompts generated from single-image 3D reconstructions. The proposed MIAVLM architecture with a Multiview Attributes Perceiver (MAP) aggregates multiple visual prompts, neutralizes input-order effects, and aligns visual cues with a frozen LLM, while employing negative instructions to curb Yes-bias. Experiments on the HoOA benchmark show improved attribute-consistent responses over baselines and demonstrate the importance of separate multiview inputs over single-image or 9-in-1 concatenations. The approach offers practical improvements for robust fine-grained attribute reasoning in LVLMs and highlights the role of multiview visual prompting and instruction design in reducing hallucinations.

Abstract

Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.
Paper Structure (11 sections, 4 equations, 5 figures, 2 tables)

This paper contains 11 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the HoOE Problem.
  • Figure 2: An overview of the MIAVLM model. Frozen parts are blue and marked with a snowflake while trainable parts are red and marked with a flame.
  • Figure 3: An overview of the Multihead Sampler.
  • Figure 4: The structure of Multiview Attributes Perceiver.
  • Figure 5: The influence of multiview images input order on OpenFlamingo awadalla2023openflamingo and MIAVLM (ours). : Outlier. Yellow line: Median. OF: OpenFlamingo.