Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning
Xiang Liu, Sen Cui, Guocai Yao, Zhong Cao, Jingheng Ma, Min Zhang, Changshui Zhang
TL;DR
The paper tackles the challenge of scalable, long-horizon embodied learning by reconstructing realistic interactive scenes from real-world observations and organizing task planning as a structured affordance graph.It introduces Affordance-Graphed Task Worlds (AGT-World) and a Self-Evolution loop in which Vision-Language Models diagnose execution failures and refine atomic policies, enabling closed-loop proposal-execution-correction cycles.The approach formalizes task generation as a graph-based path planning problem with intra-task action flows and inter-task transfers, and demonstrates large-scale generation (102 scene-task pairs) with substantial success gains and robust generalization to complex tasks.Empirical results show that graph-based generation yields high feasibility for simple tasks, while self-evolution improves long-horizon task completion; ablations highlight the importance of multi-view perception and temporal context.Overall, AGT-World provides a scalable data-generation framework with practical implications for fast, safe, sim-to-real robotic learning and lays groundwork for future RL/hierarchical integration and broader scene diversity.
Abstract
Training robotic policies directly in the real world is expensive and unscalable. Although generative simulation enables large-scale data synthesis, current approaches often fail to generate logically coherent long-horizon tasks and struggle with dynamic physical uncertainties due to open-loop execution. To address these challenges, we propose Affordance-Graphed Task Worlds (AGT-World), a unified framework that autonomously constructs interactive simulated environments and corresponding robot task policies based on real-world observations. Unlike methods relying on random proposals or static replication, AGT-World formalizes the task space as a structured graph, enabling the precise, hierarchical decomposition of complex goals into theoretically grounded atomic primitives. Furthermore, we introduce a Self-Evolution mechanism with hybrid feedback to autonomously refine policies, combining Vision-Language Model reasoning and geometric verification. Extensive experiments demonstrate that our method significantly outperforms in success rates and generalization, achieving a self-improving cycle of proposal, execution, and correction for scalable robot learning.
