Table of Contents
Fetching ...

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li

TL;DR

ENACT presents a scalable benchmark to quantify embodied cognition in Vision-Language Models by reframing world modeling from egocentric interaction as two sequence-reordering VQA tasks within a $POMDP$-style framework. It leverages a robotics-simulation pipeline (BEHAVIOR) to automatically generate 8{,}972 QA pairs across long-horizon home activities, enabling comprehensive evaluation of forward and inverse action–state reasoning against human baselines. Across proprietary and open-weight models, results show a growing performance gap with horizon, a consistent inverse-versus-forward advantage, and robust biases toward right-handed actions and human-like viewpoints, with error analyses highlighting omissions and hallucinations as dominant failure modes. Real-world evaluations corroborate simulator trends with minimal sim-to-real gap, underscoring ENACT’s utility for probing embodied decision-making in scalable, controlled settings and guiding future data design and model development for truly embodied AI.

Abstract

Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

TL;DR

ENACT presents a scalable benchmark to quantify embodied cognition in Vision-Language Models by reframing world modeling from egocentric interaction as two sequence-reordering VQA tasks within a -style framework. It leverages a robotics-simulation pipeline (BEHAVIOR) to automatically generate 8{,}972 QA pairs across long-horizon home activities, enabling comprehensive evaluation of forward and inverse action–state reasoning against human baselines. Across proprietary and open-weight models, results show a growing performance gap with horizon, a consistent inverse-versus-forward advantage, and robust biases toward right-handed actions and human-like viewpoints, with error analyses highlighting omissions and hallucinations as dominant failure modes. Real-world evaluations corroborate simulator trends with minimal sim-to-real gap, underscoring ENACT’s utility for probing embodied decision-making in scalable, controlled settings and guiding future data design and model development for truly embodied AI.

Abstract

Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

Paper Structure

This paper contains 60 sections, 7 equations, 52 figures, 12 tables, 4 algorithms.

Figures (52)

  • Figure 1: ENACT casts embodied cognition evaluation as world modeling through egocentric interaction. Grounded in POMDP framework, ENACT considers two tasks from diverse activities and scenes: forward world modeling (ordering observations given actions) and inverse world modeling (ordering actions given observations). Evaluation shows that performance of VLMs drops as interaction horizons lengthen, performs better on inverse task, and lags behind humans.
  • Figure 2: Overview of ENACT data curation pipeline. We first obtain aligned scene graphs (states) and RGB observations from a mobile manipulation dataset in a robotics simulation (BEHAVIOR). The trajectory is then segmented by identifying key-frames where an abstract state change occurs (i.e., the scene graph difference is non-empty). From this set of key-frames, we sample multiple key-frame trajectories, which are used to construct the forward and inverse world modeling VQA questions. Here $N$ refers to the number of all sampled trajectories across all step lengths.
  • Figure 3: Data sources and QA examples.ENACT is built from diverse, long-horizon activities performed by real robots (left). We provide examples for (mid) forward world modeling and (right) inverse world modeling. More QA examples and prompts are available in the Appendix \ref{['app_2_3:bench_examples']}.
  • Figure 4: Real-World Evaluations. Key frames from the three real-world scenes used in our evaluation: kitchen, dinner table, and workspace. Together, these scenes contain diverse rigid, deformable, and articulated objects in diverse environments with varying lighting conditions.
  • Figure 5: Evaluations on image realism and anthropocentric bias on human vision through ENACT. Heatmaps show two-tailed unpaired t-test results against the baseline, using Pairwise Accuracy. $p<0.05$ is considered significant. Darker red means more significant. $\Delta$ is the performance change from the baseline. If significant and $\Delta<0$, the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other). $\pm$ means standard error.
  • ...and 47 more figures