NLVR2 Visual Bias Analysis
Alane Suhr, Yoav Artzi
TL;DR
NLVR2 aimed to be robust to language bias, but this work investigates potential visual bias arising from the data collection process. By analyzing how image pairs are reused and labeled, it reveals detectable bias and quantifies a worst-case bias bound, then introduces bias-robust evaluation subsets (balanced and unbalanced) to test model reliance on visual cues. Evaluations of VisualBERT and LXMERT on these subsets show only modest performance changes, suggesting current models do not heavily exploit latent visual bias. The authors propose adding these bias-resistant subsets to the NLVR2 release and advocate evaluating the original NLVR corpus as a cross-check. Overall, the paper provides a practical framework to diagnose and mitigate visual bias in multimodal reasoning benchmarks.
Abstract
NLVR2 (Suhr et al., 2019) was designed to be robust for language bias through a data collection process that resulted in each natural language sentence appearing with both true and false labels. The process did not provide a similar measure of control for visual bias. This technical report analyzes the potential for visual bias in NLVR2. We show that some amount of visual bias likely exists. Finally, we identify a subset of the test data that allows to test for model performance in a way that is robust to such potential biases. We show that the performance of existing models (Li et al., 2019; Tan and Bansal 2019) is relatively robust to this potential bias. We propose to add the evaluation on this subset of the data to the NLVR2 evaluation protocol, and update the official release to include it. A notebook including an implementation of the code used to replicate this analysis is available at http://nlvr.ai/NLVR2BiasAnalysis.html.
