Table of Contents
Fetching ...

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

Zhenyang Li, Yangyang Guo, Kejie Wang, Xiaolin Chen, Liqiang Nie, Mohan Kankanhalli

TL;DR

The paper investigates whether Vision-Language Transformers truly exhibit visual commonsense in Visual Commonsense Reasoning (VCR). Through an empirical study of four representative VL Transformers, it reveals that pre-training offers limited transfer to VCR, language bias dominates predictions, the two sub-tasks (Q→A and QA→R) are not effectively coordinated, and tag–object correlations are underutilized. These findings suggest that current high VCR scores may reflect recognition and bias exploitation rather than genuine visual reasoning. The work advocates dataset improvements, reasoning-focused evaluation, knowledge-enhanced pre-training, debiasing, and architecture designs that better integrate visual grounding and reasoning to advance visual commonsense understanding.

Abstract

Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

TL;DR

The paper investigates whether Vision-Language Transformers truly exhibit visual commonsense in Visual Commonsense Reasoning (VCR). Through an empirical study of four representative VL Transformers, it reveals that pre-training offers limited transfer to VCR, language bias dominates predictions, the two sub-tasks (Q→A and QA→R) are not effectively coordinated, and tag–object correlations are underutilized. These findings suggest that current high VCR scores may reflect recognition and bias exploitation rather than genuine visual reasoning. The work advocates dataset improvements, reasoning-focused evaluation, knowledge-enhanced pre-training, debiasing, and architecture designs that better integrate visual grounding and reasoning to advance visual commonsense understanding.

Abstract

Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.
Paper Structure (15 sections, 6 equations, 10 figures, 5 tables)

This paper contains 15 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An exemplar of VCR. The task is composed of two sub-tasks: Q$\rightarrow$A and QA$\rightarrow$R, where the challenge mainly lies in the cross-modal reasoning from the latter.
  • Figure 2: Pipeline of Vision-Language Transformers for VCR. Q$\rightarrow$A and QA$\rightarrow$R share the same pipeline where only the input query ($QY$) and response ($RS$) are slightly different.
  • Figure 3: Failure cases from VILLA. The input to QA$\rightarrow$R consists of the correct answer (blue one from Q$\rightarrow$A), rather than the predicted answer (red one from Q$\rightarrow$A) following the default setting. It can be seen that the model makes mistakes on cases calling for fine-grained reasoning.
  • Figure 4: Convergence analysis of three VL Transformers with and without pre-training.
  • Figure 5: Attention distribution from the token of [CLS]. We empirically selected even layers for demonstration.
  • ...and 5 more figures