Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models
Huixuan Zhang, Junzhe Zhang, Xiaojun Wan
TL;DR
This work identifies number hallucination as a distinct and critical form of error in large vision-language models, focusing on the accurate counting of objects in images. It formalizes the task, constructs a 20k-counting dataset (derived from MSCOCO) with robust evaluation metrics (macro-F1, weighted-F1, MAE) and analyzes inner and outer inconsistencies through binary classification and comparison tasks. A consistency-based training paradigm, combining counting with related tasks, yields up to ~8% gains over direct finetuning across LVLMs and is model-agnostic, not requiring LLM fine-tuning. The study also reports GPT-4V showing superior but still imperfect performance, underscoring the practical importance of addressing number hallucination for safer and more reliable multimodal reasoning.
Abstract
Large-scale vision-language models have demonstrated impressive skill in handling tasks that involve both areas. Nevertheless, these models frequently experience significant issues with generating inaccurate information, which is hallucination. In this study, we concentrate on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures. We perform quantitative evaluations regarding number hallucination, showing it to be critical in major open-source large vision-language models. Furthermore, we utilizes two related tasks to conduct an in-depth analysis of number hallucination, revealing the severe inner and outer inconsistency among all tasks. Based on this examination, we devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods. Our code and dataset will be released to the community.
