Table of Contents
Fetching ...

Modeling Others' Minds as Code

Kunal Jha, Aydan Yuenan Huang, Eric Ye, Natasha Jaques, Max Kleiman-Weiner

TL;DR

This work reframes modeling others' minds as a program synthesis problem, introducing ROTE, which uses LLMs to generate executable Python representations of observed behaviors and Bayesian inference via Sequential Monte Carlo to infer the most plausible scripts. By treating action understanding as a code-generation and inference task, ROTE achieves superior generalization and efficiency, outperforming behavior cloning and LLM-based baselines by up to 50% in gridworld and embodied household scenarios, and achieving human-level accuracy on human data. The approach offers a scalable, interpretable pathway for predicting human and AI behavior in real-world settings, with implications for safer and more adaptable human-AI collaboration.

Abstract

Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient "scripts" that minimize cognitive load for actors and observers, e.g., "wait for the green light, then go." We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines -- including behavior cloning and LLM-based methods -- by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.

Modeling Others' Minds as Code

TL;DR

This work reframes modeling others' minds as a program synthesis problem, introducing ROTE, which uses LLMs to generate executable Python representations of observed behaviors and Bayesian inference via Sequential Monte Carlo to infer the most plausible scripts. By treating action understanding as a code-generation and inference task, ROTE achieves superior generalization and efficiency, outperforming behavior cloning and LLM-based baselines by up to 50% in gridworld and embodied household scenarios, and achieving human-level accuracy on human data. The approach offers a scalable, interpretable pathway for predicting human and AI behavior in real-world settings, with implications for safer and more adaptable human-AI collaboration.

Abstract

Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient "scripts" that minimize cognitive load for actors and observers, e.g., "wait for the green light, then go." We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines -- including behavior cloning and LLM-based methods -- by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.

Paper Structure

This paper contains 24 sections, 1 equation, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of action prediction methods: Behavior cloning requires large datasets and has limited generalization, while inverse planning is computationally expensive at test time. Our approach, ROTE, uses LLMs to generate efficient and interpretable code representations of observed behavior, providing a superior balance of efficiency and accuracy.
  • Figure 2: Overview of ROTE. ROTE predicts an agent's next action by generating and weighting Python programs that explain its observed behavior. From $t=0$ to $t=7$, ROTE observes a blue robot's trajectory. Initially, at $t=1$, programs related to moving to the dining room are up-weighted. However, at $t=3$, the robot picks up a toy, and ROTE remains uncertain if the goal is to clean up toys in the bedroom or place them on chairs in the living room. After the robot places the toy on a chair at $t=5$, ROTE confidently updates its program weights to reflect the "bringing toys to chairs" script. By $t=7$, ROTE can use this inferred script to rapidly and accurately predict future actions.
  • Figure 3: ROTE outperforms all baselines in both single-step and multi-step action prediction for scripted (a) and human agents (b). ROTE's code-based representations, which treat human actions as efficient scripts, enable it to generalize effectively from limited observations. For single-step predictions, ROTE was significantly more accurate than all baselines for both scripted ($p<0.05$ for NLLM, $p<0.001$ for BC and AutoToM) and human agents ($p<0.05$ for BC, $p<0.01$ for NLLM, $p<0.001$ for AutoToM). This superior performance was maintained in multi-step predictions for both agent types (scripted: $p<0.001$ for BC, AutoToM, and NLLM; human: $p<0.01$ for BC, $p<0.001$ for NLLM and AutoToM). ROTE achieved human-level predictive accuracy of human behavior.
  • Figure 4: (a) Prediction accuracy in the large-scale, partially observable Partnr environment. ROTE demonstrated a superior ability to anticipate the behavior of goal-directed, LLM-based agents, with a two-sided t-test showing ROTE significantly outperformed all other models ($p<0.001$). (b) The pseudocode example illustrates how ROTE's inferred programs capture complex task logic using conditionals and state-tracking.
  • Figure 5: Per-task accuracy comparison between different methods predicting ground truth FSM gameplay.
  • ...and 8 more figures