Table of Contents
Fetching ...

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

TL;DR

SPIRAL is introduced, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions that naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons.

Abstract

We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

TL;DR

SPIRAL is introduced, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions that naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons.

Abstract

We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.
Paper Structure (42 sections, 10 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 10 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Action World Models (ActWM): Challenges and Solution. (a) General TI2V handles instructions in a one-shot, open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. (b) We introduce a closed-loop think–act–reflect formulation, where generation proceeds step by step under explicit planning and feedback, enabling actions to be executed persistently and corrected over time. (c) To support this, ActWM-Dataset and Bench enable training and evaluation of ActWMs. (d) The closed-loop structure further supports RL optimization, allowing continuous refinement over time.
  • Figure 2: Framework Overview. (a) Closed-Loop Think-Act-Reflect: PlanAgent decomposes abstract goals into atomic plans for ActWMs execution, while CriticAgent evaluates videos to trigger dual-level feedback (Inner/Outer Loops) for refinement; (b) Progressive-Evolution GRPO: WorldModel generates group rollouts guided by PlanAgent, leveraging CriticAgent rewards for policy optimization.
  • Figure 3: Overview and Statistics of ActWM-Dataset. (a) A structured data annotation example featuring Goal, CoT, and step-wise Video-Action-Critic tuples; (b-f) Distribution analysis across video duration, step length, scene types, perspectives, and action keywords.
  • Figure 4: PlanAgent Robustness to Task Length. Comparison of accuracy (%) across varying horizons; incorporating World Memory (PlanAgent + Mem.) maintains stable performance.
  • Figure 5: CriticAgent Discriminative Capability. Incorporating RM (SFT+RM) induces highly polarized scores, providing sharper signals to better penalize failure executions.
  • ...and 6 more figures