Table of Contents
Fetching ...

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen

Abstract

As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Abstract

As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.

Paper Structure

This paper contains 35 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (A). ToM Causal Model li2025egotom (B). An overview of our method: MLLMs' visual reasoning with VisionToM intervention on the EgoToM benchmark. Given an egocentric video and a ToM question (e.g., "What is C's future goal?"), a MLLM may produce an incorrect answer based on its default attention. VisionToM extracts representations from the MLLM for visual attention and ToM reasoning, identifies attention heads sensitive to visual input and task-specific reasoning, and performs targeted interventions on these heads. This process guides the model toward accurate, goal-consistent inferences aligned with ToM reasoning.
  • Figure 2: An overview of our method: we extract internal MLLMs representations along both visual and textual dimensions and identify attention heads that are sensitive to visual inputs and task reasoning. During inference, we then apply targeted interventions to these sensitive attention heads to enhance the MLLMs' truthfulness.
  • Figure 3: (A) Linear-probing accuracy for every head and layer of LLaVA-Next-Video on the visual-attention stage, incorporating internal representations from all three tasks. Darker green indicates higher accuracy, with 50% marked as the chance baseline. (B) Linear-probing validation accuracy for every head and layer of LLaVA-Next-Video on the ToM-reasoning stage, incorporating internal representations from all three tasks. (C) Kernel density estimate (KDE) of LLaVA-Next-Video's visual-attention activations, projected onto the first two "true" directions, showing the distributions for true (green) and false (orange) sample pairs. Marginal distributions are plotted along the top and right axes. (D) Principal component analysis (PCA) plot of LLaVA-Next-Video's internal representations in the ToM-reasoning stage.
  • Figure 5: Probe validation accuracies for the three EgoToM tasks, based on activations from each attention head across all layers of LLaVA‑Next‑Video‑7B. Subfigures (A)–(C) correspond to the ToM reasoning stage, showing accuracies for the (A) goal prediction, (B) belief inference, and (C) actions inference tasks, respectively. Subfigures (D)–(F) correspond to the visual attention stage, showing the same tasks in the order: (D) goal prediction, (E) belief inference, and (F) actions inference. Darker shades indicate higher probing accuracy, suggesting stronger task-relevant signals in specific heads and layers.
  • Figure 6: Probe validation accuracies for the three EgoToM tasks, based on activations from each attention head across all layers of Qwen2.5-VL-7B. Subfigures (A)–(C) correspond to the ToM reasoning stage, showing accuracies for the (A) goal prediction, (B) belief inference, and (C) actions inference tasks, respectively. Subfigures (D)–(F) correspond to the visual attention stage, showing the same tasks in the order: (D) goal prediction, (E) belief inference, and (F) actions inference. Darker shades indicate higher probing accuracy, suggesting stronger task-relevant signals in specific heads and layers.
  • ...and 8 more figures