PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

Jiazhen Liu; Yuhan Fu; Ruobing Xie; Runquan Xie; Xingwu Sun; Fengzong Lian; Zhanhui Kang; Xirong Li

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Xirong Li

TL;DR

This work introduces PhD, a large-scale visual hallucination evaluation dataset for MLLMs, designed to quantify and dissect hallucinations across four modes and five visual tasks. Built via a ChatGPT-assisted pipeline, PhD combines image-specific hitem selection, hitem-embedded Q&A, specious/incorrect contexts, and CCS-generated images to produce 102k VQA triplets. Comprehensive evaluation across open-source and proprietary MLLMs, plus mitigation methods, reveals mode- and task-dependent weaknesses, with insights into how visual ambiguity, multi-modal inconsistency, and counter-common-sense content drive hallucinations. The dataset provides a practical, scalable tool for diagnosing hallucination sources and guiding targeted model improvements. PhD thus offers a robust pathway for advancing reliable multimodal reasoning in real-world applications.

Abstract

Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e. task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with specious context (PhD-sec) or with incorrect context ({PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, specious / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 6 tables)

This paper contains 15 sections, 6 figures, 6 tables.

Introduction
Related Work
Our Roadmap to PhD
Task-specific Hitem Selection
Hitem-embedded Question Generation
Specious (Incorrect) Context Generation
CCS Image Generation
Dataset Overview and PhD Index
Evaluating MLLMs on PhD
Common Setup
Using PhD for Overall VHE
Using PhD for Mode-Oriented VHE
Using PhD for Task-Oriented VHE
Analysis of MLLM Answer Tendency
Summary and Conclusions

Figures (6)

Figure 1: Illustrations of three major causes of an MLLM's visual hallucination and its evaluation. This paper contributes PhD, a binary VQA-based VHE benchmark, much larger and more challenging than its predecessors. In particular, it has four evaluation modes that explicitly measure an MLLM's performance w.r.t. the three causes, i.e.PhD-base for cause I, PhD-sec and PhD-icc for cause II and PhD-ccs for cause III.
Figure 2: Proposed semi-automatic pipeline for PhD construction. We use ChaptGPT (GPT-4o mini) to generate hitem-embedded questions / contexts for daily images, and Doubao and DALL-E3 for generating CCS images. Depending on what image (daily or CCS) is used and whether a specific context precedes a question, PhD supports four evaluation modes: PhD-base, i.e. questions about daily images w/o context, PhD-sec, i.e.PhD-base plus specious context, PhD-icc, i.e.PhD-base plus incorrect context, and PhD-ccs, i.e. questions about CCS images. By adapting TDIUC annotations, PhD supports binary VQA w.r.t. five visual recognition tasks including object / attribute / sentiment / positional recognition and counting. With 20 mode-task combinations in total, PhD enables a comprehensive VHE.
Figure 3: Qualitative results showing how an MLLM answers visual questions from PhD. The correctness of an answer is automatically determined by matching its first word, either Yes or No, with the ground truth (GT).
Figure 4: PhD based VHE analytics. Models required paid services, shown in gray markers, are tested on random-2k.
Figure 5: Impact of LLM size (7B vs 13B) on LLaVA-1.5, LLaVA-1.6 and InstructBLIP. MLLMs using a 13B LLM tend to be better than their counterparts using a 7B LLM on PhD-sec and PhD-icc, yet worse on PhD-base and PhD-ccs.
...and 1 more figures

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

TL;DR

Abstract

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (6)