Table of Contents
Fetching ...

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, Nanyun Peng

TL;DR

A large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation is proposed, which highlights the critical balance between faithfulness and coverage of model outputs and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.

Abstract

Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose a large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human-annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

TL;DR

A large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation is proposed, which highlights the critical balance between faithfulness and coverage of model outputs and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.

Abstract

Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose a large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human-annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.
Paper Structure (31 sections, 3 equations, 6 figures, 17 tables)

This paper contains 31 sections, 3 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Example of the hallucination in open vocabulary generation task of LVLMs. Our proposed framework can identify objects, attributes, and relations from the generated captions and provide a comprehensive evaluation of faithfulness and coverage. We highlight hallucinated features and uncovered features.
  • Figure 2: Overview of our proposed benchmark VALOR-Bench collection procedure: (1) Image collection (\ref{['sec:images_collection']}): (a) Co-occurrence statistics calculation (\ref{['sec:dependecies_calculation']}): We employ two statistical measures to determine co-occurring features – frequencies and conditional probabilities; (b) Image extraction (\ref{['sec:extraction_steps']}): Next, we leverage the identified co-occurrence statistics to systematically extract images from existing datasets; (2) Human Annotations (\ref{['sec:annotation']}): Finally, we manually annotate each image within the distinct feature subsets, adhering to the definition in \ref{['sec:definition']}. Here, we provide an example of how we use the co-occurrence statistics to select images for object subsets and add human annotations for later evaluation.
  • Figure 3: Overview of VALOR-Eval evaluation framework: (1) Firstly, LVLMs generate captions from VALOR-Bench benchmark images. (2) Following this, LLMs are employed to extract pivotal features that encapsulate from the generated descriptions. (3) Subsequently, these features are aligned with a pre-defined list of ground-truth features using LLMs, facilitating the creation of two essential outputs: a dictionary of matched features and a more extensive dictionary encompassing broader conceptual matches. (4) Finally, we calculate two key metrics: faithfulness and coverage. These metrics measure the LVLMs' comprehension by evaluating how well the generated captions encapsulate the salient features of the images and the breadth of concepts they cover, respectively.
  • Figure 4: Object existence evaluation example from three representative models in our benchmark VALOR-Bench. Text in red indicating models' hallucinations.
  • Figure 5: Positional relation evaluation example from three representative models in our benchmark VALOR-Bench. Text in red indicating models' hallucinations.
  • ...and 1 more figures