Table of Contents
Fetching ...

Measuring Faithful and Plausible Visual Grounding in VQA

Daniel Reich, Felix Putze, Tanja Schultz

TL;DR

This work introduces Faithful & Plausible Visual Grounding (FPVG), a quantitative metric for evaluating whether a VQA model bases its answers on question-relevant image objects in a faithful and human-plausible way. FPVG tests three input conditions—all objects, only relevant objects, and only irrelevant objects—to determine whether a model's answer remains stable when informative content is removed and changes when non-informative content is introduced. Through evaluations on a broad set of models and the GQA dataset, FPVG reveals that grounding and accuracy can diverge, with grounding quality playing a crucial role in out-of-distribution (OOD) generalization. The work also contrasts FPVG with sufficiency and comprehensiveness, demonstrates its faithfulness via object-importance analyses, and discusses limitations related to annotations and detector dependencies. Overall, FPVG provides a practical, interpretable tool to diagnose and improve the grounding behavior of VG-enabled VQA systems and to study grounding’s impact on OOD performance.

Abstract

Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.

Measuring Faithful and Plausible Visual Grounding in VQA

TL;DR

This work introduces Faithful & Plausible Visual Grounding (FPVG), a quantitative metric for evaluating whether a VQA model bases its answers on question-relevant image objects in a faithful and human-plausible way. FPVG tests three input conditions—all objects, only relevant objects, and only irrelevant objects—to determine whether a model's answer remains stable when informative content is removed and changes when non-informative content is introduced. Through evaluations on a broad set of models and the GQA dataset, FPVG reveals that grounding and accuracy can diverge, with grounding quality playing a crucial role in out-of-distribution (OOD) generalization. The work also contrasts FPVG with sufficiency and comprehensiveness, demonstrates its faithfulness via object-importance analyses, and discusses limitations related to annotations and detector dependencies. Overall, FPVG provides a practical, interpretable tool to diagnose and improve the grounding behavior of VG-enabled VQA systems and to study grounding’s impact on OOD performance.

Abstract

Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.
Paper Structure (45 sections, 12 equations, 8 figures, 5 tables)

This paper contains 45 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Faithful & Plausible Visual Grounding: The VQA model's answer given all objects in the image ($A_{all}$) should equal its answer when given only relevant objects w.r.t. the question ($A_{rel}$), and should differ when given only irrelevant objects ($A_{irrel}$). The figure shows a model's behavior for a question deemed faithfully and plausibly grounded.
  • Figure 2: Examples for the four FPVG sub-categories defined in § \ref{['sec:FVG_metric']}. Each sub-category encapsulates specific answering behavior for a given question in FPVG's three test cases ($A_{all}$, $A_{rel}$, $A_{irrel}$). Categorization depends on grounding status ("FPVG") and answer correctness ("Acc"). E.g., questions that return a correct answer in $A_{all}$ and $A_{rel}$ and an incorrect answer in $A_{irrel}$ are categorized as (a). The model's behavior in cases (a) and (b) satisfies the criteria for the question to be categorized as faithfully & plausibly visually grounded.
  • Figure 3: Left: Percentage of samples with best (worst) $suff$ & $comp$ scores (medium scores not pictured). Many samples with the $suff$ property lack $comp$ and vice-versa (gray). Right: LOO-based ranking match percentages for samples in $suff$, $comp$ and FPVG (higher is better). Model: UpDn.
  • Figure 4: Sample distribution and answer class flip percentages depending on metric categorization. X-axis: VG quality categories based on $suff$ & $comp$ (left) and FPVG (right). Y-axis: percentage of flipped answers in each category. Note that in this figure, FPVG's formulation is interpreted in terms of $suff$ (Eq. \ref{['eq:fvg1']}, right side, left term) and $comp$ (right term). Model: UpDn.
  • Figure 5: Correct to incorrect (c2i) answer ratios for questions categorized as $FPVG_{\{+,-\}}$. Data set: GQA-101k.
  • ...and 3 more figures