Table of Contents
Fetching ...

Social World Models

Xuhui Zhou, Jiarui Liu, Akhila Yerukola, Hyunwoo Kim, Maarten Sap

TL;DR

This work defines Social World Models (SWMs) and introduces S^3AP, a structured representation for social world states that encodes environment state, agent observations, actions, and mental states. By parsing free-text narratives into S^3AP and inducing SWMs, the approach achieves state-of-the-art performance on static ToM benchmarks and improves interactive social reasoning in SOTOPIA-hard tasks, including clear gains when integrating SWMs with first-person decision making. The key findings show that explicit modeling of hidden mental states and the use of structured representations yield substantial performance gains across diverse, multi-agent scenarios, while revealing that parser quality and action integration are critical for success. The work argues for a foundational, general-purpose representation to enable more socially-aware AI, while acknowledging challenges in parsing faithfulness, long-horizon dynamics, computation, and safety considerations for real-world deployment.

Abstract

Humans intuitively navigate social interactions by simulating unspoken dynamics and reasoning about others' perspectives, even with limited information. In contrast, AI systems struggle to structure and reason about implicit social contexts, as they lack explicit representations for unobserved dynamics such as intentions, beliefs, and evolving social states. In this paper, we introduce the concept of social world models (SWMs) to characterize the complex social dynamics. To operationalize SWMs, we introduce a novel structured social world representation formalism (S3AP), which captures the evolving states, actions, and mental states of agents, addressing the lack of explicit structure in traditional free-text-based inputs. Through comprehensive experiments across five social reasoning benchmarks, we show that S3AP significantly enhances LLM performance-achieving a +51% improvement on FANToM over OpenAI's o1. Our ablations further reveal that these gains are driven by the explicit modeling of hidden mental states, which proves more effective than a wide range of baseline methods. Finally, we introduce an algorithm for social world models using S3AP, which enables AI agents to build models of their interlocutors and predict their next actions and mental states. Empirically, S3AP-enabled social world models yield up to +18% improvement on the SOTOPIA multi-turn social interaction benchmark. Our findings highlight the promise of S3AP as a powerful, general-purpose representation for social world states, enabling the development of more socially-aware systems that better navigate social interactions.

Social World Models

TL;DR

This work defines Social World Models (SWMs) and introduces S^3AP, a structured representation for social world states that encodes environment state, agent observations, actions, and mental states. By parsing free-text narratives into S^3AP and inducing SWMs, the approach achieves state-of-the-art performance on static ToM benchmarks and improves interactive social reasoning in SOTOPIA-hard tasks, including clear gains when integrating SWMs with first-person decision making. The key findings show that explicit modeling of hidden mental states and the use of structured representations yield substantial performance gains across diverse, multi-agent scenarios, while revealing that parser quality and action integration are critical for success. The work argues for a foundational, general-purpose representation to enable more socially-aware AI, while acknowledging challenges in parsing faithfulness, long-horizon dynamics, computation, and safety considerations for real-world deployment.

Abstract

Humans intuitively navigate social interactions by simulating unspoken dynamics and reasoning about others' perspectives, even with limited information. In contrast, AI systems struggle to structure and reason about implicit social contexts, as they lack explicit representations for unobserved dynamics such as intentions, beliefs, and evolving social states. In this paper, we introduce the concept of social world models (SWMs) to characterize the complex social dynamics. To operationalize SWMs, we introduce a novel structured social world representation formalism (S3AP), which captures the evolving states, actions, and mental states of agents, addressing the lack of explicit structure in traditional free-text-based inputs. Through comprehensive experiments across five social reasoning benchmarks, we show that S3AP significantly enhances LLM performance-achieving a +51% improvement on FANToM over OpenAI's o1. Our ablations further reveal that these gains are driven by the explicit modeling of hidden mental states, which proves more effective than a wide range of baseline methods. Finally, we introduce an algorithm for social world models using S3AP, which enables AI agents to build models of their interlocutors and predict their next actions and mental states. Empirically, S3AP-enabled social world models yield up to +18% improvement on the SOTOPIA multi-turn social interaction benchmark. Our findings highlight the promise of S3AP as a powerful, general-purpose representation for social world states, enabling the development of more socially-aware systems that better navigate social interactions.

Paper Structure

This paper contains 51 sections, 2 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: A world model that only tracks the physical state of the world (left) and a social world model that tracks the physical state of the world and the mental states of other agents (right).
  • Figure 2: An example of free-form narrative parsed into S$^3$ AP. The highlighted text is trasformed to the S$^3$ AP representation with the state field which tracks the overall environment state, observations of each agent and actions of each agent.
  • Figure 3: Performance of different models on ParaToMi using S$^3$ AP representations generated by various models. Numbers in parentheses show performance change.
  • Figure 4: Illustrative example of social context parsing failure from error analysis.
  • Figure 5: Number of LLM calls vs accuracy comparison of ToM methods on ParaToMi with GPT-4o (for both the parser and the QA model). The baselines include specialized ToM methods (TT, AutoToM) and agentic frameworks (AFlow, LLM-Debate). S$^3$ AP achieves the highest accuracy with a low number of LLM calls.
  • ...and 5 more figures