Unpacking the Individual Components of Diffusion Policy
Xiu Yuan
TL;DR
Diffusion Policy enables robot action generation via a conditional denoising diffusion process. This paper systematically ablates five core components—Observation Sequence Input ($T_o$), Action Sequence Execution ($T_a$ with horizon $T_p$), Receding Horizon Control, Denoising Network Architecture (U-Net/Transformer vs. MLP), and FiLM Conditioning—across eight tasks in ManiSkill and Adroit. Key findings show that past observations are crucial for Absolute Control but less so for Delta Control, action horizons improve performance in most tasks while shorter rolls aid real-time tasks, receding horizon helps long-horizon planning, U-Net/Transformer backbones are better for hard tasks, and FiLM conditioning boosts hard-task performance. The results yield practical guidelines for selecting components based on task complexity and horizon, informing future research and industrial deployment of Diffusion Policy.
Abstract
Imitation Learning presents a promising approach for learning generalizable and complex robotic skills. The recently proposed Diffusion Policy generates robot action sequences through a conditional denoising diffusion process, achieving state-of-the-art performance compared to other imitation learning methods. This paper summarizes five key components of Diffusion Policy: 1) observation sequence input; 2) action sequence execution; 3) receding horizon; 4) U-Net or Transformer network architecture; and 5) FiLM conditioning. By conducting experiments across ManiSkill and Adroit benchmarks, this study aims to elucidate the contribution of each component to the success of Diffusion Policy in various scenarios. We hope our findings will provide valuable insights for the application of Diffusion Policy in future research and industry.
