Table of Contents
Fetching ...

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, Shuaicheng Liu

TL;DR

Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information is proposed.

Abstract

Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://github.com/Youqiang-Gui/SeedPolicy.

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

TL;DR

Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information is proposed.

Abstract

Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://github.com/Youqiang-Gui/SeedPolicy.
Paper Structure (29 sections, 8 equations, 13 figures, 9 tables)

This paper contains 29 sections, 8 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Horizon scaling analysis. (a) DP shows a counter-intuitive performance drop as the observation horizon grows, dropping to 0% at large horizons (data omitted). (b) In contrast, our approach enables robust horizon scaling, utilizing long observation horizons to improve task success rates.
  • Figure 2: Overview of the SeedPolicy framework. The system takes current RGB images and joint poses as input, encoding them via a ResNet Encoder. The core Self-Evolving Gated Attention (SEGA) module (blue box) recursively updates a time-evolving latent state ($State \ t$) to capture long-term spatiotemporal dependencies while generating enhanced observation features ($EObs_t$). These context-rich features are then fed into the Action Expert, a transformer-based diffusion model, to predict a sequence of future actions.
  • Figure 3: (a) SEGA employs a dual-stream design: the State Update stream (top) evolves the latent state ($State_{t-1}$) by integrating new observations, while the State Retrieval stream (bottom) utilizes historical context to generate enhanced observation features ($EObs_t$). (b) The Self-Evolving Gate (SEG) dynamically computes a gating signal directly from the cross-attention maps. It selectively fuses the intermediate evolved state ($\text{Inter} \cdot S_{t}$) with the previous state, ensuring only semantically relevant information is preserved while filtering out noise.
  • Figure 4: Performance comparison across varying task length. A consistent trend emerges in both architectures: as the task length increases, the performance gap between SeedPolicy and the baseline progressively widens. This validates the architecture-agnostic effectiveness of our approach, demonstrating that the advantage of our explicit temporal modeling becomes increasingly significant in long-horizon scenarios compared to fixed-window baselines.
  • Figure 5: Qualitative visualization of failure cases in simulation. We compare the successful execution of SeedPolicy (top row) against representative failure modes of the DP across three tasks: (a) Put_Bottles_ Dustbin ("clean" setting), (b) Handover_Mic ("hard" setting), and (c) Grab_Roller ("hard" setting). Red circles highlight critical errors, including execution stagnation (getting stuck) and spatial positioning failures (collisions or air grabs). Additional visualizations are provided in the Appendix.
  • ...and 8 more figures