Table of Contents
Fetching ...

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

Shailaja Keyur Sampat, Mutsumi Nakamura, Shankar Kailas, Kartik Aggarwal, Mandy Zhou, Yezhou Yang, Chitta Baral

TL;DR

This paper proposes VL-GLUE, a multitask benchmark for natural language understanding that is quite challenging for existing large-scale vision-language models and encourages development of systems that possess robust visuo-linguistic reasoning capabilities.

Abstract

Deriving inference from heterogeneous inputs (such as images, text, and audio) is an important skill for humans to perform day-to-day tasks. A similar ability is desirable for the development of advanced Artificial Intelligence (AI) systems. While state-of-the-art models are rapidly closing the gap with human-level performance on diverse computer vision and NLP tasks separately, they struggle to solve tasks that require joint reasoning over visual and textual modalities. Inspired by GLUE (Wang et. al., 2018)- a multitask benchmark for natural language understanding, we propose VL-GLUE in this paper. VL-GLUE consists of over 100k samples spanned across seven different tasks, which at their core require visuo-linguistic reasoning. Moreover, our benchmark comprises of diverse image types (from synthetically rendered figures, and day-to-day scenes to charts and complex diagrams) and includes a broad variety of domain-specific text (from cooking, politics, and sports to high-school curricula), demonstrating the need for multi-modal understanding in the real-world. We show that this benchmark is quite challenging for existing large-scale vision-language models and encourage development of systems that possess robust visuo-linguistic reasoning capabilities.

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

TL;DR

This paper proposes VL-GLUE, a multitask benchmark for natural language understanding that is quite challenging for existing large-scale vision-language models and encourages development of systems that possess robust visuo-linguistic reasoning capabilities.

Abstract

Deriving inference from heterogeneous inputs (such as images, text, and audio) is an important skill for humans to perform day-to-day tasks. A similar ability is desirable for the development of advanced Artificial Intelligence (AI) systems. While state-of-the-art models are rapidly closing the gap with human-level performance on diverse computer vision and NLP tasks separately, they struggle to solve tasks that require joint reasoning over visual and textual modalities. Inspired by GLUE (Wang et. al., 2018)- a multitask benchmark for natural language understanding, we propose VL-GLUE in this paper. VL-GLUE consists of over 100k samples spanned across seven different tasks, which at their core require visuo-linguistic reasoning. Moreover, our benchmark comprises of diverse image types (from synthetically rendered figures, and day-to-day scenes to charts and complex diagrams) and includes a broad variety of domain-specific text (from cooking, politics, and sports to high-school curricula), demonstrating the need for multi-modal understanding in the real-world. We show that this benchmark is quite challenging for existing large-scale vision-language models and encourage development of systems that possess robust visuo-linguistic reasoning capabilities.

Paper Structure

This paper contains 28 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Examples from Task 1: (top) BlocksWorld dataset gokhale2019cooking repurposed to create VL reasoning format (bottom) example directly incorporated from CLEVR_HYP sampat2021clevr_hyp
  • Figure 2: Examples from Task 2: (top & mid) binary VL classification questions based on NLVR suhr2019corpus and COCO chen2015microsoft datasets respectively (bottom) image selection type VL problem based on PIQA bisk2020piqa
  • Figure 3: Examples from Task 3: (top) bar chart demonstrating GDP% for healthcare expenditure of different countries (bottom) line chart demonstrating Puerto Rico's GDP% over years, which are generated using tabular data crawled from CIA factbook cia2019factbook, along with hand-crafted questions that require VL reasoning
  • Figure 4: Examples from Task 4, adapted from PISA oecd test which involve freeform figures
  • Figure 5: Example from Task 5, which is a subset of MultimodalQA talmor2020multimodalqa involving image+text context: without correctly recognizing the person in the image (Tiger Woods) and corresponding information provided in the passage (the years when Tiger Woods was a top-ranked golf player), the given question cannot be answered
  • ...and 3 more figures