Table of Contents
Fetching ...

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

Yuanyuan Mao, Xin Lin, Qin Ni, Liang He

TL;DR

BDIQA introduces a Theory of Mind–driven benchmark for VideoQA, enabling evaluation of Belief, Desire, and Intention reasoning in two developmental levels. The dataset is synthesized from VirtualHome videos, comprising 3,527 videos and 19,932 QA pairs, with diversified question types and a two-tier difficulty to probe cognitive inferences. Across zero-shot, few-shot, and supervised settings, existing pre-trained VideoQA models struggle with cognitive reasoning tasks, while end-to-end models with memory modules and richer visual backbones achieve incremental gains. The study provides two guidelines—emphasizing perception as a foundation and integrating human-like multi-step reasoning—to enhance cognitive reasoning in VideoQA, highlighting the need for ToM-aware architectures in future work.

Abstract

As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero shot, few shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

TL;DR

BDIQA introduces a Theory of Mind–driven benchmark for VideoQA, enabling evaluation of Belief, Desire, and Intention reasoning in two developmental levels. The dataset is synthesized from VirtualHome videos, comprising 3,527 videos and 19,932 QA pairs, with diversified question types and a two-tier difficulty to probe cognitive inferences. Across zero-shot, few-shot, and supervised settings, existing pre-trained VideoQA models struggle with cognitive reasoning tasks, while end-to-end models with memory modules and richer visual backbones achieve incremental gains. The study provides two guidelines—emphasizing perception as a foundation and integrating human-like multi-step reasoning—to enhance cognitive reasoning in VideoQA, highlighting the need for ToM-aware architectures in future work.

Abstract

As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero shot, few shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.
Paper Structure (21 sections, 4 figures, 7 tables)

This paper contains 21 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example of how ToM is involved in human action explanation. Job takes the only food in fridge away and leaves the kitchen while Alice is in the living room. Alice’s desire is to have a meal (desire). In the last picture, she is planning to fetch food (intention). She is walking to the empty fridge because she mistakenly thinks that the food is in the fridge (belief) and hold a false belief about the food.
  • Figure 2: The definition of perception, desire, belief and intention and the relationship of them during human cognitive process.
  • Figure 3: A example for true belief and unsatisfied desire for Alice. Alice fails to have a meal because of Job. And during that time they never leave kitchen and they have a true belief about the food which is consistent with real world. "Fetch food" is a required sub-task."Switch off TV" is an optional sub-task because it is a necessary step for "have a meal".
  • Figure 4: Data statistics.