Table of Contents
Fetching ...

Uncovering the Full Potential of Visual Grounding Methods in VQA

Daniel Reich, Tanja Schultz

TL;DR

This paper argues that VG methods in VQA are hampered by flawed evaluation practices that assume access to complete, question-relevant visual content. It introduces True Visual Grounding (TVG) testing and Infusion (INF) training to ensure content-guided learning, and uses both spatial and semantic relevance matching to map ground-truth relevance to input features, including symbolic object representations. Across GQA and VQA-HAT datasets, the authors show that under corrected conditions VG-methods yield substantial accuracy and FPVG improvements, particularly in OOD settings, with semantic matching and INF training providing the strongest gains. The work highlights the importance of evaluation design for VG in VQA and provides a practical framework and code to enable more accurate assessment of VG-methods, with implications for improving model reliability and out-of-distribution robustness.

Abstract

Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model's reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-scale VQA, where the information carried by visual features frequently deviates from expected ground-truth contents. As a result, training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits. In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information. Our experiments show that these methods can be much more effective when evaluation conditions are corrected. Code is provided on GitHub.

Uncovering the Full Potential of Visual Grounding Methods in VQA

TL;DR

This paper argues that VG methods in VQA are hampered by flawed evaluation practices that assume access to complete, question-relevant visual content. It introduces True Visual Grounding (TVG) testing and Infusion (INF) training to ensure content-guided learning, and uses both spatial and semantic relevance matching to map ground-truth relevance to input features, including symbolic object representations. Across GQA and VQA-HAT datasets, the authors show that under corrected conditions VG-methods yield substantial accuracy and FPVG improvements, particularly in OOD settings, with semantic matching and INF training providing the strongest gains. The work highlights the importance of evaluation design for VG in VQA and provides a practical framework and code to enable more accurate assessment of VG-methods, with implications for improving model reliability and out-of-distribution robustness.

Abstract

Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model's reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-scale VQA, where the information carried by visual features frequently deviates from expected ground-truth contents. As a result, training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits. In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information. Our experiments show that these methods can be much more effective when evaluation conditions are corrected. Code is provided on GitHub.
Paper Structure (41 sections, 7 figures, 5 tables)

This paper contains 41 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example of Flawed VG that VG-methods in VQA teach based on the unverified assumption of presence of relevant visual information (left). Correct content cues are a prerequisite for teaching True VG (right).
  • Figure 2: Symbolic features.
  • Figure 3: Accuracy improvements from VG-methods compared to respective UpDn baselines. Training (y-axis): DET features with spatial matching ("Flawed", top row), and INF features with semantic matching ("True", bottom row). Striped bars: INF features with spatial matching. Testing (x-axis): Full test ("Flawed") or TVG subset ("True").
  • Figure 4: $FPVG_+$ measured on TVG subsets (ID/OOD) for UpDn. Left: Absolute $FPVG_+$ measurements. Right: $FPVG_+$ improvements compared to respective UpDn baselines. Columns categorize the matching method used for FPVG (see Sec. \ref{['sec:vg_measurements']}). Striped bars show results for INF-based models trained with spatial matching.
  • Figure 5: VQA-HAT-CP: Accuracy and $FPVG_+$ measurements (all values based on averages over five differently seeded UpDn models). See captions of Fig. \ref{['fig:updn_vgmethods']} and Fig. \ref{['fig:fpvg_updn_gqacp']} for a description of these histograms.
  • ...and 2 more figures