Table of Contents
Fetching ...

Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning

Xiang Liu, Sen Cui, Guocai Yao, Zhong Cao, Jingheng Ma, Min Zhang, Changshui Zhang

TL;DR

The paper tackles the challenge of scalable, long-horizon embodied learning by reconstructing realistic interactive scenes from real-world observations and organizing task planning as a structured affordance graph.It introduces Affordance-Graphed Task Worlds (AGT-World) and a Self-Evolution loop in which Vision-Language Models diagnose execution failures and refine atomic policies, enabling closed-loop proposal-execution-correction cycles.The approach formalizes task generation as a graph-based path planning problem with intra-task action flows and inter-task transfers, and demonstrates large-scale generation (102 scene-task pairs) with substantial success gains and robust generalization to complex tasks.Empirical results show that graph-based generation yields high feasibility for simple tasks, while self-evolution improves long-horizon task completion; ablations highlight the importance of multi-view perception and temporal context.Overall, AGT-World provides a scalable data-generation framework with practical implications for fast, safe, sim-to-real robotic learning and lays groundwork for future RL/hierarchical integration and broader scene diversity.

Abstract

Training robotic policies directly in the real world is expensive and unscalable. Although generative simulation enables large-scale data synthesis, current approaches often fail to generate logically coherent long-horizon tasks and struggle with dynamic physical uncertainties due to open-loop execution. To address these challenges, we propose Affordance-Graphed Task Worlds (AGT-World), a unified framework that autonomously constructs interactive simulated environments and corresponding robot task policies based on real-world observations. Unlike methods relying on random proposals or static replication, AGT-World formalizes the task space as a structured graph, enabling the precise, hierarchical decomposition of complex goals into theoretically grounded atomic primitives. Furthermore, we introduce a Self-Evolution mechanism with hybrid feedback to autonomously refine policies, combining Vision-Language Model reasoning and geometric verification. Extensive experiments demonstrate that our method significantly outperforms in success rates and generalization, achieving a self-improving cycle of proposal, execution, and correction for scalable robot learning.

Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning

TL;DR

The paper tackles the challenge of scalable, long-horizon embodied learning by reconstructing realistic interactive scenes from real-world observations and organizing task planning as a structured affordance graph.It introduces Affordance-Graphed Task Worlds (AGT-World) and a Self-Evolution loop in which Vision-Language Models diagnose execution failures and refine atomic policies, enabling closed-loop proposal-execution-correction cycles.The approach formalizes task generation as a graph-based path planning problem with intra-task action flows and inter-task transfers, and demonstrates large-scale generation (102 scene-task pairs) with substantial success gains and robust generalization to complex tasks.Empirical results show that graph-based generation yields high feasibility for simple tasks, while self-evolution improves long-horizon task completion; ablations highlight the importance of multi-view perception and temporal context.Overall, AGT-World provides a scalable data-generation framework with practical implications for fast, safe, sim-to-real robotic learning and lays groundwork for future RL/hierarchical integration and broader scene diversity.

Abstract

Training robotic policies directly in the real world is expensive and unscalable. Although generative simulation enables large-scale data synthesis, current approaches often fail to generate logically coherent long-horizon tasks and struggle with dynamic physical uncertainties due to open-loop execution. To address these challenges, we propose Affordance-Graphed Task Worlds (AGT-World), a unified framework that autonomously constructs interactive simulated environments and corresponding robot task policies based on real-world observations. Unlike methods relying on random proposals or static replication, AGT-World formalizes the task space as a structured graph, enabling the precise, hierarchical decomposition of complex goals into theoretically grounded atomic primitives. Furthermore, we introduce a Self-Evolution mechanism with hybrid feedback to autonomously refine policies, combining Vision-Language Model reasoning and geometric verification. Extensive experiments demonstrate that our method significantly outperforms in success rates and generalization, achieving a self-improving cycle of proposal, execution, and correction for scalable robot learning.
Paper Structure (55 sections, 1 theorem, 28 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 55 sections, 1 theorem, 28 equations, 10 figures, 9 tables, 1 algorithm.

Key Result

Proposition 4.1

Let $S_\tau\subset \mathcal{S}$ denote the global world state at time $\tau$. Assume the AGT-World graph $G_{S_0}\subset\mathcal{G}$ satisfies: Then for any complex long-horizon task $\mathcal{T}=\{T_k\}_{k=1}^{K}$ and global world state $S_0=S_\text{init}^{(1)},S_\tau=S_\text{goal}^{(K)}\subset \mathcal{S}$, there exists such that with $[0,\tau]=\bigcup_{k=1}^K[\tau_{k-1},\tau_{k}]$.

Figures (10)

  • Figure 1: An introduction of our method.A. Video generation models often produce physically implausible behaviors. We instead employ a physics simulation engine to reconstruct semantic and global-state preserving simulated scenes from real-world images at low cost. B. Randomly generated scenes are far from realism, and RL suffers from sparse rewards. We adopt affordance-graphed task generation, which enables planning and execution of action sequences in arbitrary scenes. C. In long-horizon tasks, policies typically lack execution feedback leading to error accumulation. We introduce VLM-supervised self-evolution to iteratively correct task actions during execution.
  • Figure 2: Affordance-Graphed Task Worlds. For any complex long-horizon task, they are decomposed into multiple simple tasks, connected via inter-task edges that bridge different object slices or reset temporal states.
  • Figure 3: Graph-Based Task Generation with Self-Evolution. For the left part, Stage I: Received a user keyword and scene information, producing the complete task name and an expanded task-scene description. Stage II: Leveraging the definitions of $o_t,o_{s_1},o_{s_2}$ and boundary cases for decomposition, provides subtask description, and precise matching of relevant objects. Stage III: For each subtask, adjust target object size, generate a BDDL behavior1k file that defines $s_\text{init}, s_\text{goal}$, and match each step with corresponding actions from the primitive action library. For the right part, we use Self-Evolution Mechanism to generate new action sequence and reasons until maximum iteration times per subtask. For details, see Appendix \ref{['app:task_generation_details']},\ref{['app:self_evolution_details']}.
  • Figure 4: Detailed Action Display of Self-Evolution. Here we take the task of "Transport a glass into a refrigerator" as an example. This task is divided into four subtasks: "open refrigerator", "pick up glass", "put glass into refrigerator" and "close refrigerator". Among them, subtask 3 was successfully completed after 3 iterations, while subtask 4 went through 2 iterations.
  • Figure 5: Comprehensive Evaluation and Ablation Studies.(a) Generalization performance across 34 diverse real-world scenes, and results are split into two type of tasks. Color indicates success (Green) or failure (Red). (b) Task progression trajectory showing the relationship between cumulative iterations and sub-task completion. (c) Cost distribution per subtask, visualizing the density of required iterations. (d) Ablation study on camera views, comparing SR across different visual input configurations. (e) Ablation study on context window ($p_1$), analyzing the impact of history window on SR. All ablations are performed on task $\mathcal{T}_1$ using the Qwen3-VL.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 3.1: Simple Task
  • Definition 3.2: Complex Long-Horizon Task
  • Proposition 4.1: Global Reachability via Hierarchical Composition
  • proof