Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Xinyi He; Ying Yang; Chuanjian Fu; Sihan Guo; Songchun Zhu; Lifeng Fan; Zhenliang Zhang; Yujia Peng

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Xinyi He, Ying Yang, Chuanjian Fu, Sihan Guo, Songchun Zhu, Lifeng Fan, Zhenliang Zhang, Yujia Peng

TL;DR

The paper tackles the problem of evaluating embodied agents in unseen 3D environments, where existing benchmarks suffer from data contamination and lack scene specificity. It introduces TEA, a two-stage interaction-evolution framework that defines tasks as graph-structured representations and generates them in-situ within Unreal Engine through an agent-in-loop interaction and graph-based recombination. Key contributions include formal graph-based task definitions, an agent–environment interaction mechanism, a structure-based task recombination/reuse strategy, and metrics like MIR and spatial statistics to measure task diversity and coverage. Empirical results show TEA can autonomously generate tens of thousands of tasks across unseen scenes (e.g., 87,876 tasks in two loops over 10 scenes) and reveals that state-of-the-art models still struggle with basic perception and 3D-aware interaction, underscoring the need for in-situ evaluation before real-world deployment. The authors also release their generated data to accelerate research in embodied intelligence and in-situ task generation.

Abstract

As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system for embodied agents (TEA). In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation that allows for continuous task generation. In the evolution stage, task graph modeling allows us to recombine and reuse existing tasks to generate new ones without external data. Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles, which human verification confirmed to be physically reasonable and encompassing essential daily cognitive capabilities. Benchmarking SOTA models against humans on our in-situ tasks reveals that models, despite excelling on public benchmarks, perform surprisingly poorly on basic perception tasks, severely lack 3D interaction awareness and show high sensitivity to task types in reasoning. These sobering findings highlight the necessity of in-situ evaluation before deploying agents into real-world human environments.

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

TL;DR

Abstract

Paper Structure (10 sections, 8 equations, 4 figures, 4 tables)

This paper contains 10 sections, 8 equations, 4 figures, 4 tables.

Introduction
Methods
Task definition
A Two-Stage Task Generation Method
Agent-in-Loop Task Generation Method
Structure-based Task Recombination and Reuse
Experiments
Experiment 1: Generated Tasks
Experiment 2: TEA-Test
Discussion and Conclusion

Figures (4)

Figure 1: Overview of two-stage dynamic task generation system and task examples. (a) Samples of UE and real-world scenes, and data returned by the UE simulator. (b) Agent–Environment Interaction (left): The agent executes tasks, collects data, and generates new tasks based on recorded data. A task filter selects a subset of tasks for the next iteration, while an $\epsilon$-randomwalk ensures diversity. Task Evolution (right): The task recombination and reuse strategy leverages existing tasks to form new ones. (c) Example tasks such as navigation and object in-view check (e.g., the red table is not visible from the agent’s view).
Figure 2: Task evolution visualization: solid arrows represent generating tasks through interaction during the execution of initial tasks, while dashed arrows represent evolving tasks based on existing tasks. For task evolution, blue triangular structures depict the reuse structure between tasks while the red dashed triangular illustrates task recombination which exchange nodes with different attributes (underlined words represent the added elements in the final state compared with the initial state).
Figure 3: A Taxonomy of Tasks. Tasks are organized by dimension (2D image vs. 3D physical space) and cognitive load (from perception to reasoning and decision-making).
Figure 4: Comparison of MIR across four VLMs and 10 different scenes. Each point represents one scene and bars show the average MIR over ten scenes. Our methods consistently improve task diversity (t-tests: $^{**}$$p<0.01$, $^{***}$$p<0.001$).

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

TL;DR

Abstract

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (4)