Table of Contents
Fetching ...

Trust but Verify: Programmatic VLM Evaluation in the Wild

Viraj Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu

TL;DR

This work provides a large language model with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and proposes a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework.

Abstract

Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Project page: \url{https://prove-explorer.netlify.app/}.

Trust but Verify: Programmatic VLM Evaluation in the Wild

TL;DR

This work provides a large language model with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and proposes a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework.

Abstract

Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Project page: \url{https://prove-explorer.netlify.app/}.

Paper Structure

This paper contains 13 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Top. Existing VLM benchmarks either limit query-types to easy-to-evaluate but restrictive binary questions, or use external LLMs to generate open-ended questions (without verifying their validity) and score answers (often without complete image context or a clear scoring rubric). Bottom. We propose PROVE, a new benchmark that constructs high-fidelity scene-graph representations from hyper-detailed image captions, that are queried via an LLM-generated program to verify a free-form generated question-answer pair. At test-time, we perform an interpretable programmatic evaluation of the helpfulness and truthfulness of free-form VLM responses by comparing scene-graphs.
  • Figure 2: Top. Existing VLM hallucination evaluation benchmarks either measure VLM performance on object existence queries ("discriminative" li2023evaluating) or object precision/recall in generated image captions ("generative, templated" rohrbach2018object), neither of which realistically simulate in-the-wild usage. Some recent benchmarks contain open-ended queries ("generative, free-form" sun2023aligning), which are more realistic but also harder to both generate (e.g. see unnatural QA-pair from GAVIE liu2023mitigating -- first from right), and evaluate with an LLM-as-judge (e.g. see GPT-4 penalizing a correct response that includes details absent from the ground truth in MMHal-Bench sun2023aligning -- second from right). Bottom. We propose PROVE , a benchmark of challenging but verifiable open-ended questions that we use to jointly evaluate both the truthfulness and helpfulness of free-form model responses.
  • Figure 3: The PROVE dataset. For each image-caption pair, we generate a high-fidelity scene graph representation with which we prompt an LLM to generate challenging QA pairs and their verification programs. We only retain QA pairs that we can programmatically verify, ensuring diverse but reliable evaluation data that is grounded by design.
  • Figure 4: We plot $\mathsf{hscore}$ and $\mathsf{tscore}$ for VLMs on PROVE -- as seen, models with higher helpfulness tend to lag behind on truthfulness, with very few striking a good trade-off between the two. Averaged across models, we observe a weak linear correlation of 0.03 between $\mathsf{hscore}$ and $\mathsf{tscore}$ .
  • Figure 5: Example responses from two VLMs that achieve high $\mathsf{hscore}$ (GPT-4o) and $\mathsf{tscore}$ (LLaVA-1.5 (7B)) respectively. While both models struggle with sub-tasks such as OCR, counting, and reading an analog clock, GPT-4o's errors tend to be less egregious which leads to a higher $\mathsf{hscore}$.
  • ...and 4 more figures