SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

Yu Yang; Yue Liao; Jianbiao Mei; Baisen Wang; Xuemeng Yang; Licheng Wen; Jiangning Zhang; Xiangtai Li; Hanlin Chen; Botian Shi; Yong Liu; Shuicheng Yan; Gim Hee Lee

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

TL;DR

SPIRAL is introduced, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions that naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons.

Abstract

We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

TL;DR

Abstract

Paper Structure (42 sections, 10 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 10 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Method
Overview
PlanAgent: Structured Reasoning and Planning
Action-Conditioned World Model
CriticAgent: Reward and Closed-Loop Feedback
Progressive-Evolving: Closed-Loop GRPO Training
Dataset and Benchmark
ActWM-Dataset Construction
ActWM-Bench Construction
Experiments
Experimental Settings
Evaluations on PlanAgent
Evaluations on CriticAgent
...and 27 more sections

Figures (11)

Figure 1: Action World Models (ActWM): Challenges and Solution. (a) General TI2V handles instructions in a one-shot, open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. (b) We introduce a closed-loop think–act–reflect formulation, where generation proceeds step by step under explicit planning and feedback, enabling actions to be executed persistently and corrected over time. (c) To support this, ActWM-Dataset and Bench enable training and evaluation of ActWMs. (d) The closed-loop structure further supports RL optimization, allowing continuous refinement over time.
Figure 2: Framework Overview. (a) Closed-Loop Think-Act-Reflect: PlanAgent decomposes abstract goals into atomic plans for ActWMs execution, while CriticAgent evaluates videos to trigger dual-level feedback (Inner/Outer Loops) for refinement; (b) Progressive-Evolution GRPO: WorldModel generates group rollouts guided by PlanAgent, leveraging CriticAgent rewards for policy optimization.
Figure 3: Overview and Statistics of ActWM-Dataset. (a) A structured data annotation example featuring Goal, CoT, and step-wise Video-Action-Critic tuples; (b-f) Distribution analysis across video duration, step length, scene types, perspectives, and action keywords.
Figure 4: PlanAgent Robustness to Task Length. Comparison of accuracy (%) across varying horizons; incorporating World Memory (PlanAgent + Mem.) maintains stable performance.
Figure 5: CriticAgent Discriminative Capability. Incorporating RM (SFT+RM) induces highly polarized scores, providing sharper signals to better penalize failure executions.
...and 6 more figures

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

TL;DR

Abstract

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (11)