Table of Contents
Fetching ...

HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

Nigel Doering, Rahath Malladi, Arshia Sangwan, David Danks, Tauhidur Rahman

TL;DR

HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains, is introduced and a critical limitation is identified: while the hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states.

Abstract

Theory of mind (ToM) enables AI systems to infer agents' hidden goals and mental states, but existing approaches focus mainly on small human understandable gridworld spaces. We introduce HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains. Inspired by the belief-desire-intention structure of human cognition, our three-level VAE hierarchy achieves substantial performance improvements on a 3,185-node campus navigation task. However, we identify a critical limitation: while our hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states. We propose self-supervised alignment strategies and present this work to solicit community feedback on grounding approaches.

HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

TL;DR

HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains, is introduced and a critical limitation is identified: while the hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states.

Abstract

Theory of mind (ToM) enables AI systems to infer agents' hidden goals and mental states, but existing approaches focus mainly on small human understandable gridworld spaces. We introduce HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains. Inspired by the belief-desire-intention structure of human cognition, our three-level VAE hierarchy achieves substantial performance improvements on a 3,185-node campus navigation task. However, we identify a critical limitation: while our hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states. We propose self-supervised alignment strategies and present this work to solicit community feedback on grounding approaches.
Paper Structure (11 sections, 6 equations, 4 figures, 2 tables)

This paper contains 11 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: HiVAE first encodes the agent’s partial trajectory and the environment graph into a unified latent representation. This representation feeds a hierarchical mind-state module, sequentially inferring beliefs, desires, and intentions, which then drives the goal predictor to output a probability distribution over all possible goals.
  • Figure 2: Overall performance of models on goal prediction averaged across all trajectories. Lower is better.
  • Figure 3: Probability of False Goal on Pedestrian Dataset at Different Path Completion Percentages Leading up to the False Goal - Experiment 2. Lower is better.
  • Figure 4: Overall performance of models on goal prediction averaged across all trajectories for both original and new pedestrian datasets. Lower is better.