Table of Contents
Fetching ...

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, Kai-Wei Chang

TL;DR

GD-VCR introduces a geo-diverse visual commonsense reasoning benchmark to probe geo-location biases in vision-language models. By collecting images and QA pairs across Western, East Asian, South Asian, and African contexts and using a three-stage annotation pipeline, the authors show Western-trained V&L models struggle with non-Western scenarios and high-order reasoning. The study reveals larger disparities for geo-specific and high-level questions, with humans generalizing far better than models. The work highlights the importance of geo-aware data and reasoning in multimodal understanding and provides resources to spur further research on inclusive, region-sensitive commonsense reasoning.

Abstract

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

TL;DR

GD-VCR introduces a geo-diverse visual commonsense reasoning benchmark to probe geo-location biases in vision-language models. By collecting images and QA pairs across Western, East Asian, South Asian, and African contexts and using a three-stage annotation pipeline, the authors show Western-trained V&L models struggle with non-Western scenarios and high-order reasoning. The study reveals larger disparities for geo-specific and high-level questions, with humans generalizing far better than models. The work highlights the importance of geo-aware data and reasoning in multimodal understanding and provides resources to spur further research on inclusive, region-sensitive commonsense reasoning.

Abstract

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.

Paper Structure

This paper contains 48 sections, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples in GD-VCR. The three images are all about weddings but from different regions (left-to-right order: Western, East Asian, South Asian). Current Vision-and-Language models perform well on answering questions about Western weddings but often make mistakes when encountering wedding scenarios in other regions.
  • Figure 2: Overall annotation pipeline. It is divided into three stages: Stage 1 is to collect images with regional characteristics; Stage 2 is to design QA pairs based on the detected objects; Stage 3 is to generate answer candidates to complete the dataset with the help of the answer choices in the original VCR development set.
  • Figure 3: Visualization of the performance gap on images of the same scenarios in Western and non-Western regions. The larger the characters, the larger the performance gap over the scenarios. The red and blue words are the scenarios whose performance gap is above and below 8%, respectively. Detailed accuracy on these scenarios is shown in Appendix \ref{['appendix:detailed_accuracy']}.
  • Figure 4: Statistics of keyword occurrences. "Others" denotes the average occurrences of the keywords appearing less than 20 times.
  • Figure 5: Examples of the images regarding "customer". Left-to-right order: Western, South Asia, East Asia. We visualize the prediction of the VisualBERT model fine-tuned on the original VCR training set. The blue blocks denote the right answer choices. If red block appears, it means that VisualBERT wrongly predict the answer. The rightmost value indicates the probability of the corresponding choices being selected by VisualBERT.
  • ...and 1 more figures