Table of Contents
Fetching ...

Grounding Social Perception in Intuitive Physics

Lance Ying, Aydan Y. Huang, Aviv Netanyahu, Andrei Barbu, Boris Katz, Joshua B. Tenenbaum, Tianmin Shu

Abstract

People infer rich social information from others' actions. These inferences are often constrained by the physical world: what agents can do, what obstacles permit, and how the physical actions of agents causally change an environment and other agents' mental states and behavior. We propose that such rich social perception is more than visual pattern matching, but rather a reasoning process grounded in an integration of intuitive psychology with intuitive physics. To test this hypothesis, we introduced PHASE (PHysically grounded Abstract Social Events), a large dataset of procedurally generated animations, depicting physically simulated two-agent interactions on a 2D surface. Each animation follows the style of the Heider and Simmel movie, with systematic variation in environment geometry, object dynamics, agent capacities, goals, and relationships (friendly/adversarial/neutral). We then present a computational model, SIMPLE, a physics-grounded Bayesian inverse planning model that integrates planning, probabilistic planning, and physics simulation to infer agents' goals and relations from their trajectories. Our experimental results showed that SIMPLE achieved high accuracy and agreement with human judgments across diverse scenarios, while feedforward baseline models -- including strong vision-language models -- and physics-agnostic inverse planning failed to achieve human-level performance and did not align with human judgments. These results suggest that our model provides a computational account for how people understand physically grounded social scenes by inverting a generative model of physics and agents.

Grounding Social Perception in Intuitive Physics

Abstract

People infer rich social information from others' actions. These inferences are often constrained by the physical world: what agents can do, what obstacles permit, and how the physical actions of agents causally change an environment and other agents' mental states and behavior. We propose that such rich social perception is more than visual pattern matching, but rather a reasoning process grounded in an integration of intuitive psychology with intuitive physics. To test this hypothesis, we introduced PHASE (PHysically grounded Abstract Social Events), a large dataset of procedurally generated animations, depicting physically simulated two-agent interactions on a 2D surface. Each animation follows the style of the Heider and Simmel movie, with systematic variation in environment geometry, object dynamics, agent capacities, goals, and relationships (friendly/adversarial/neutral). We then present a computational model, SIMPLE, a physics-grounded Bayesian inverse planning model that integrates planning, probabilistic planning, and physics simulation to infer agents' goals and relations from their trajectories. Our experimental results showed that SIMPLE achieved high accuracy and agreement with human judgments across diverse scenarios, while feedforward baseline models -- including strong vision-language models -- and physics-agnostic inverse planning failed to achieve human-level performance and did not align with human judgments. These results suggest that our model provides a computational account for how people understand physically grounded social scenes by inverting a generative model of physics and agents.

Paper Structure

This paper contains 33 sections, 8 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: (A) Examples of real-life social interactions in physical environments: a basketball player trying to block an opponent from shooting a ball; two people carrying a couch together. (B) The classic Heider-Simmel animation abstracts such real-life interactions in animated displays of simple geometric shapes. (C) PHASE models social interactions as physically grounded abstract events, where animated agents with limited fields of view move in a physics-based environment with objects, landmarks, and obstacles, enabling behaviors such as helping, hindering, or collaborating toward goals.
  • Figure 2: Example PHASE animations for different scenario types. A). In this collaborative scenario, agents collaborate to move the pink circle to the yellow landmark. The green agent is too large to move past the barrier. Therefore, the red agent first moves the circle to the left of the barrier and then carries it together with the green agent to the yellow landmark. B). Two agents move towards different landmarks with no interaction. C). In this competing scenario, the agents want to move the blue circle to different landmarks, but there is only one blue circle. They therefore pull the circle in different directions. D). In this opposing scenario, the red agent tries to chase the green agent, while the green agent tries to get away from the red agent. The videos for these examples can be found at https://osf.io/fkp5m/.
  • Figure 3: Consistent human responses showing how many videos (percentages) were assigned with an interaction category by at least 50% of the participants who have watched the videos.
  • Figure 4: (A) Pseudo code for the SIMPLE model. (B) Illustration of the key model components of SIMPLE. (C) Illustration of the joint physical and social simulator in SIMPLE.
  • Figure 5: Accuracy results for goal classification task across all 100 scenarios and grouped by 4 distinct scenario goal types. The number of goal judgments is shown in the bracket. Humans and models are evaluated on 100 test videos, each with two goal classification tasks and one relation classification task. Error bars show 95% confidence interval from 1000 bootstrapped samples.
  • ...and 7 more figures