Table of Contents
Fetching ...

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu

TL;DR

MuMA-ToM introduces the first multi-modal, multi-agent Theory of Mind benchmark, combining video and text to probe beliefs, social goals, and beliefs about others' goals in embodied household interactions. It jointly evaluates humans and a spectrum of baselines, revealing a substantial gap between current large multimodal systems and human ToM capabilities. The authors propose LIMP, a language model-based inverse multi-agent planning approach that uses natural language representations and multimodal fusion to perform multi-agent ToM reasoning, achieving a substantial performance lead over baselines. While promising, the work also acknowledges remaining gaps and outlines future extensions to more complex, real-world, and multi-agent scenarios.

Abstract

Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

TL;DR

MuMA-ToM introduces the first multi-modal, multi-agent Theory of Mind benchmark, combining video and text to probe beliefs, social goals, and beliefs about others' goals in embodied household interactions. It jointly evaluates humans and a spectrum of baselines, revealing a substantial gap between current large multimodal systems and human ToM capabilities. The authors propose LIMP, a language model-based inverse multi-agent planning approach that uses natural language representations and multimodal fusion to perform multi-agent ToM reasoning, achieving a substantial performance lead over baselines. While promising, the work also acknowledges remaining gaps and outlines future extensions to more complex, real-world, and multi-agent scenarios.

Abstract

Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
Paper Structure (44 sections, 1 equation, 11 figures, 3 tables)

This paper contains 44 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Example questions for each question type. We provide keyframes for the video in each example. The conversations in the chat bubbles are provided as subtitles and shown as part of the multi-modal inputs when viewing the video. Note that the captions on the bottom of the frames are for illustrative purposes only and are not shown in the videos. The checkmarks indicate the correct answers. We provide the videos and text for the examples in the supplementary material.
  • Figure 2: Overview of LIMP. LIMP has three components: (1) the multi-modal information fusion module extracts and fuses information from vision and text; (2) the hypothesis parsing module generates hypothetical values for the three mental variables given the question and the fused information; and (3) the inverse multi-agent planning module assesses the probabilities of each option given the hypothetical mental variables and the multi-modal agent behavior described in the fused information.
  • Figure 3: Illustration of the multi-modal information fusion in LIMP. It fills in missing information based on the context and recovers the initial state from agents' actions.
  • Figure 4: Illustration for inverse multi-agent planning. We estimate the action and utterance likelihood of agent $i$ at each step $t$ given the past actions and utterances of both agents from step 0 to step $t-1$, the initial state $s^0$, and the hypothesis $H$. LL in the figure stands for likelihood.
  • Figure 5: Human and model performance on MuMA-ToM.
  • ...and 6 more figures