Table of Contents
Fetching ...

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xiaoqin Zhang, Ling Shao, Shijian Lu

TL;DR

LongHalQA is proposed, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text and introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators.

Abstract

Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data. Dataset and evaluation code are available at https://github.com/hanqiu-hq/LongHalQA.

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

TL;DR

LongHalQA is proposed, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text and introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators.

Abstract

Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data. Dataset and evaluation code are available at https://github.com/hanqiu-hq/LongHalQA.

Paper Structure

This paper contains 19 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: LongHalQA is featured with two novel tasks, namely, Hallucination Discrimination and Hallucination Completion, which unify both discriminative and generative evaluations into the same multiple-choice-question form without requiring costly LLM evaluations. It comprises three types of long-context data, including Object-level Description, Image-level Description, and Multi-round Conversation. Compared with short and simple questions in existing benchmarks like "Is there an {object} in the image?", the three types of data are more open-ended, richer in contextual information, and closer to real-world data. White circle in image emphasizes the hallucination of passengers.
  • Figure 2: LongHalQA involves complex hallucination annotations involving logic and textual consistency, which are closer to hallucinations in real-world MLLM application scenarios.
  • Figure 3: Comparison of evaluation times under different settings and MLLMs. Our proposed multiple-choice hallucination completion task is significantly faster than other(existing) setups, especially for large models. We measure the time taken to evaluate 1,000 image-text pairs under three different evaluation settings. Only the time that MLLMs take to generate text is measured without considering the time for evaluations by other LLM evaluators. All MLLMs are tested on one A100 except LLaVA 1.6-34B and Qwen2-VL-72B on an H100.
  • Figure 4: Visualizations of hallucination types from H1 to H6.
  • Figure 5: Visualizations of hallucination types from H7 to H12.
  • ...and 5 more figures