Table of Contents
Fetching ...

MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

TL;DR

The paper introduces MMToM-QA, the first benchmark for multimodal Theory of Mind reasoning that combines home-video and textual descriptions to assess belief and goal inference. It also presents BIP-ALM, a model that unifies visual-textual symbolic representations with Bayesian inverse planning and LM-based likelihoods to perform robust inference across modalities. Experimental results show that humans outpace all models, while BIP-ALM significantly outperforms strong baselines like GPT-4 in multimodal ToM tasks and generalizes to unseen environments. The work highlights the gap in current large models' ToM capabilities and demonstrates the promise of integrating model-based reasoning with language models for multimodal social intelligence.

Abstract

Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

MMToM-QA: Multimodal Theory of Mind Question Answering

TL;DR

The paper introduces MMToM-QA, the first benchmark for multimodal Theory of Mind reasoning that combines home-video and textual descriptions to assess belief and goal inference. It also presents BIP-ALM, a model that unifies visual-textual symbolic representations with Bayesian inverse planning and LM-based likelihoods to perform robust inference across modalities. Experimental results show that humans outpace all models, while BIP-ALM significantly outperforms strong baselines like GPT-4 in multimodal ToM tasks and generalizes to unseen environments. The work highlights the gap in current large models' ToM capabilities and demonstrates the promise of integrating model-based reasoning with language models for multimodal social intelligence.

Abstract

Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
Paper Structure (40 sections, 2 equations, 11 figures, 7 tables)

This paper contains 40 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Sketch of the MMToM-QA benchmark. Each question is associated with a video stream (representative frames highlighting key moments are shown above for illustration) and text input (illustrative text above is shortened for brevity). In the example video, Emily can see the wine glass on one of the kitchen tables (1st frame) and passes by it without picking it up (2nd frame). At the end of the clip (3rd frame), it appears that she could be walking towards the cabinets on the left side of the room; or she might want to check if a goal object is inside the microwave. The text indicates that there are no cupcakes in the cabinets, but there is a cupcake inside the microwave. To confidently choose the correct answer, a model must fuse relevant information from both the video and the text.
  • Figure 2: Question types in MMToM-QA, with examples. Questions fall into two broad categories, Belief and Goal, with several different question types in each category that span a range of mental reasoning. Each example shows only a few frames and snippets. The options in the green, italic font are correct answers. Note that we simplify the text in the examples for brevity. We provide the full text and the video links in Appendix \ref{['sec:app_examples']}.
  • Figure 3: Overview of our model, BIP-ALM. For visual, linguistic, and fused information, we show examples of the symbolic representations of states ($s^{1:t}$), actions ($a^{1:t}$), and the two hypotheses about the person's goal ($g_1$ and $g_2$) and belief ($b_1^t$ and $b_2^t$) for a question asked at time step $t$.
  • Figure 4: Overall human and model performance in the three conditions. The dashed line shows the chance level.
  • Figure 5: Examples of how BIP-ALM evaluates the likelihood of different hypotheses via the action likelihood estimation from the language model. The results here are based on BIP-ALM with finetuned LLaMA 2. The green option in each example is the correct answer and BIP-ALM selects the correct answers in both cases. The blue panels show the likelihood ratio estimated by the language model at a certain step for each example, explaining how BIP-ALM can come to the correct conclusions by conducting inverse planning via a language model. (A) It is more likely for Elizabeth to open the microwave if she believes that there is a water glass inside the microwave and that she wants to get a water glass, even though there is not any water glass inside the microwave (i.e., she has a false belief)). (B) The likelihood of hypothesis will change after the model observes more actions.
  • ...and 6 more figures