ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li
TL;DR
ENACT presents a scalable benchmark to quantify embodied cognition in Vision-Language Models by reframing world modeling from egocentric interaction as two sequence-reordering VQA tasks within a $POMDP$-style framework. It leverages a robotics-simulation pipeline (BEHAVIOR) to automatically generate 8{,}972 QA pairs across long-horizon home activities, enabling comprehensive evaluation of forward and inverse action–state reasoning against human baselines. Across proprietary and open-weight models, results show a growing performance gap with horizon, a consistent inverse-versus-forward advantage, and robust biases toward right-handed actions and human-like viewpoints, with error analyses highlighting omissions and hallucinations as dominant failure modes. Real-world evaluations corroborate simulator trends with minimal sim-to-real gap, underscoring ENACT’s utility for probing embodied decision-making in scalable, controlled settings and guiding future data design and model development for truly embodied AI.
Abstract
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
