Table of Contents
Fetching ...

Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

TL;DR

MToMnet addresses the challenge of predicting beliefs and their dynamics from multimodal nonverbal cues by explicitly modelling Theory of Mind. It builds two parallel MindNets that separately encode each agent's cues while sharing contextual feature extractors, enabling triadic person–context reasoning; three ToM variants (DB-MToMnet, IC-MToMnet, CG-MToMnet) explore different ways of integrating mind representations. Across BOSS and TBD datasets, CG-MToMnet achieves state-of-the-art belief and belief-dynamics prediction with substantially fewer parameters than prior methods, and benefits from explicit ToM modelling by large margins (up to ~60% on TBD). The work demonstrates that ToM-inspired architectural choices can yield robust, efficient prediction of human beliefs from nonverbal cues and can be extended to multi-agent interactions, with clear avenues for future benchmark improvements and ethical considerations.

Abstract

We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.

Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions

TL;DR

MToMnet addresses the challenge of predicting beliefs and their dynamics from multimodal nonverbal cues by explicitly modelling Theory of Mind. It builds two parallel MindNets that separately encode each agent's cues while sharing contextual feature extractors, enabling triadic person–context reasoning; three ToM variants (DB-MToMnet, IC-MToMnet, CG-MToMnet) explore different ways of integrating mind representations. Across BOSS and TBD datasets, CG-MToMnet achieves state-of-the-art belief and belief-dynamics prediction with substantially fewer parameters than prior methods, and benefits from explicit ToM modelling by large margins (up to ~60% on TBD). The work demonstrates that ToM-inspired architectural choices can yield robust, efficient prediction of human beliefs from nonverbal cues and can be extended to multi-agent interactions, with clear avenues for future benchmark improvements and ethical considerations.

Abstract

We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.
Paper Structure (34 sections, 9 equations, 8 figures, 3 tables)

This paper contains 34 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Examples from BOSS duan2022boss and TBD fan2021learning. BOSS includes third-person video frames, bounding boxes (top left), 3D gaze (top centre) and body pose (top right). TBD includes third-person (bottom left) and first-person (bottom centre) video frames, 2D gaze (bottom centre, pink dot) and body pose (bottom right).
  • Figure 2: Accuracy for belief prediction on BOSS for our MToMnet models and baselines duan2022boss using all input modalities. Scores significantly different from CG$\parallel$-MToMnet according to a paired t-test ($p<0.05$) are marked with a *.
  • Figure 3: Modality ablation study for BOSS.
  • Figure 4: Modality ablation study for TBD.
  • Figure 5: Examples from PCA results for $\bm{h}_1$ and $\bm{h}_2$ from CG$\parallel$-MToMnet, taken from BOSS and TBD test set.
  • ...and 3 more figures