Table of Contents
Fetching ...

Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

Zehao Wang, Xinpeng Liu, Yudonglin Zhang, Xiaoqian Wu, Zhou Fang, Yifan Fang, Junfu Pu, Cewu Lu, Yong-Lu Li

TL;DR

Verb Mirage identifies verb-level hallucination as a critical but neglected failure mode in multimodal LLMs and introduces a verb-centered benchmarking framework using HICO and CharadesEgo. The study reveals that existing object-focused mitigation methods fail to address verbs, and models rely heavily on object cues with miscalibrated verb tokens. A rich verb-knowledge-based fine-tuning approach, leveraging Pangea verb semantics and LoRA, significantly reduces verb hallucination but still leaves substantial gaps. The work highlights the need for verb-aware data and methods to improve action understanding in vision-language systems.

Abstract

Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, $\textit{etc}$. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations about $\textbf{object/noun-related}$ concepts. Verb concepts, crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the $\textbf{first}$ to investigate the $\textbf{verb hallucination}$ phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination on verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

TL;DR

Verb Mirage identifies verb-level hallucination as a critical but neglected failure mode in multimodal LLMs and introduces a verb-centered benchmarking framework using HICO and CharadesEgo. The study reveals that existing object-focused mitigation methods fail to address verbs, and models rely heavily on object cues with miscalibrated verb tokens. A rich verb-knowledge-based fine-tuning approach, leveraging Pangea verb semantics and LoRA, significantly reduces verb hallucination but still leaves substantial gaps. The work highlights the need for verb-aware data and methods to improve action understanding in vision-language systems.

Abstract

Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, . However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations about concepts. Verb concepts, crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the to investigate the phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination on verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

Paper Structure

This paper contains 38 sections, 4 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Besides the well-discussed object hallucination, in this paper, we unveil the severe verb hallucination of state-of-the-art MLLMs with our designed benchmarks. All models show low object hallucination (on POPE) but severe verb hallucination. Gemini-1.5-Flash and GPT-4-Turbo are tested with 100 randomly sampled questions.
  • Figure 1: Results on YN and MC questions w/ and w/o object reference. Red: high recall. Blue: low precision. Bold: higher MC acc w/ object reference than w/o object referece.
  • Figure 2: We probe MLLM verb hallucination from various perspectives, eg., question formats, the existence of object correlation, different fields of view, image qualities, verb semantics, and image semantics.
  • Figure 3: Comparison of YN questions with correct answer No on rare and common subsets.
  • Figure 4: Comparison between objects and verbs.
  • ...and 14 more figures