What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Letian Zhang; Xiaotong Zhai; Zhongkai Zhao; Yongshuo Zong; Xin Wen; Bingchen Zhao

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao

TL;DR

Presents C-VQA, a novel benchmark to evaluate counterfactual reasoning in multi-modal language models by augmenting VQAv2-based questions with counterfactual presuppositions and including a synthetic dataset. The authors find that state-of-the-art end-to-end and neuro-symbolic models, including GPT-4V, fail to consistently handle counterfactual queries, with large performance drops especially for indirect and boolean questions. The work also reveals gender-related biases in model responses and demonstrates limited generalization to synthetic data. The dataset and code are released to facilitate future research toward human-level vision-language reasoning.

Abstract

Counterfactual reasoning, a fundamental aspect of human cognition, involves contemplating alternatives to established facts or past events, significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models, we explore their effectiveness in counterfactual reasoning. To facilitate this investigation, we introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions, spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data, representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial performance drops, with some models showing up to a 40% decrease, highlighting a significant gap between current models and human-like vision reasoning capabilities. We hope our dataset will serve as a vital benchmark for evaluating the counterfactual reasoning capabilities of models. Code and dataset are publicly available at https://bzhao.me/C-VQA/.

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

TL;DR

Abstract

Paper Structure (35 sections, 9 figures, 7 tables)

This paper contains 35 sections, 9 figures, 7 tables.

Introduction
Related Works
Visual Question Answering.
Evaluation of Reasoning Abilities.
Multi-Modal LLM Benchmarks
Dataset
Annotation
Counterfactual presupposition type of C-VQA-Real.
Question and answer annotation of C-VQA-Real.
Verification
(i) Whether the new question is image-related?
(ii) Whether the new answer is reasonable?
Implementation of C-VQA-Synthetic
Flower-Counting Puzzles.
Dot-Counting Puzzles.
...and 20 more sections

Figures (9)

Figure 1: Examples of C-VQA (top), and performance comparison of LLaVA-1.5 liu2023improved w/ and w/o counterfactuality (bottom). C-VQA is constructed by adding counterfactual presuppositions to the questions. We observe that state-of-the-art models all exhibited significant performance drops on the counterfactual questions.
Figure 2: Our annotation flow for C-VQA-Real. We select images and questions from the VQAv2 dataset antol2015vqa, and then utilize ChatGPT to add counterfactual presupposition to the questions and get the corresponding answers. All questions and answers are carefully inspected by human annotators.
Figure 3: Breakdown of answers in numerical groups of C-VQA-Real. We show the percentage of answers in the numerical direct group and numerical indirect group. The share of 0, 1, and 2 in the indirect group are higher while the others are lower.
Figure 4: Qualitative example of biases in MLLMs. Given similar questions, InstructBLIP (Vicuna-7B) provides correct answers for the male instance but incorrect answers for the female instance.
Figure 5: Performance difference of original and counterfactual questions on the male and female subgroup on C-VQA-Real. We can see that end-to-end models are often biased toward the male subgroup, and neuro-symbolic models are biased toward the female subgroup. The larger the gap between the performance differences, the larger the bias.
...and 4 more figures

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

TL;DR

Abstract

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)