Table of Contents
Fetching ...

CELLO: Causal Evaluation of Large Vision-Language Models

Meiqi Chen, Bo Peng, Yan Zhang, Chaochao Lu

TL;DR

The paper addresses the limited causal reasoning capabilities of large vision-language models by proposing a fine-grained causal framework for interactions among humans and objects, and by introducing CELLO, a dataset of 14,094 causal questions across all four levels of the Ladder of Causation with explicit causal graphs. It pairs CELLO with CELLO-CoT, a causally inspired prompting strategy that decomposes problems into structured reasoning steps to elicit chain-of-thought solutions from LVLMs. The authors provide a thorough dataset construction pipeline, extensive baselining across ten LVLMs, and comprehensive analyses including ablations, robustness testing, and error analysis, showing that current models struggle with causal tasks but can benefit from CELLO-CoT. The work offers a formal benchmark and practical prompting techniques that advance evaluation and development of causal reasoning in vision-language systems, with implications for embodied AI and autonomous systems.

Abstract

Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects. Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study provide valuable insights for future research. Our project page is at https://github.com/OpenCausaLab/CELLO.

CELLO: Causal Evaluation of Large Vision-Language Models

TL;DR

The paper addresses the limited causal reasoning capabilities of large vision-language models by proposing a fine-grained causal framework for interactions among humans and objects, and by introducing CELLO, a dataset of 14,094 causal questions across all four levels of the Ladder of Causation with explicit causal graphs. It pairs CELLO with CELLO-CoT, a causally inspired prompting strategy that decomposes problems into structured reasoning steps to elicit chain-of-thought solutions from LVLMs. The authors provide a thorough dataset construction pipeline, extensive baselining across ten LVLMs, and comprehensive analyses including ablations, robustness testing, and error analysis, showing that current models struggle with causal tasks but can benefit from CELLO-CoT. The work offers a formal benchmark and practical prompting techniques that advance evaluation and development of causal reasoning in vision-language systems, with implications for embodied AI and autonomous systems.

Abstract

Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects. Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study provide valuable insights for future research. Our project page is at https://github.com/OpenCausaLab/CELLO.
Paper Structure (55 sections, 20 figures, 5 tables)

This paper contains 55 sections, 20 figures, 5 tables.

Figures (20)

  • Figure 1: An example of causal reasoning in the vision-language context. LVLMs (e.g., GPT-4o) might generate inappropriate responses due to a limited understanding of causal relationships.
  • Figure 2: Three different causal relationships considered in the vision-language context: object-object, human-object, and human-human causal relationships.
  • Figure 3: Dataset construction pipeline of CELLO (using confounder identification task as an example). First, we extract causal graphs from scene graphs that include relationships and regions within an image. Then, we select corresponding causal tasks based on the ladder of causation. Finally, causal questions are constructed by employing templates with an LLM. We consider four types of causal graphs and twelve different causal tasks in total.
  • Figure 4: Question quality of CELLO compared to other vision-language datasets in terms of lexical diversity and fluency.
  • Figure 5: Illustration of our CELLO-CoT strategy.
  • ...and 15 more figures