DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization
Nayoung Oh, Jaehyeong Jang, Moonkyeong Jung, Daehyung Park
TL;DR
DiSPo introduces a diffusion-SSM policy that blends diffusion-based denoising with a Mamba state-space model to enable multi-granularity imitation learning from coarse demonstrations. It modulates action discretization online via step-scale factors and uses pseudo demonstrations to synthesize high-frequency actions, achieving robust coarse-to-fine reproduction in simulation and on UR5e hardware. Key contributions include the step-scaled SSM, step-scale predictor, and a two-stage training regime (pretraining with frequency augmentation and pseudo-demo fine-tuning), resulting in improved success rates and inference efficiency. The approach is particularly impactful for scalable, memory-efficient manipulation where fine-grained actions must be generated on demand while leveraging coarse demonstrations.
Abstract
We aim to solve the problem of generating coarse-to-fine skills learning from demonstrations (LfD). To scale precision, traditional LfD approaches often rely on extensive fine-grained demonstrations with external interpolations or dynamics models with limited generalization capabilities. For memory-efficient learning and convenient granularity change, we propose a novel diffusion-SSM based policy (DiSPo) that learns from diverse coarse skills and produces varying control scales of actions by leveraging a state-space model, Mamba. Our evaluations show the adoption of Mamba and the proposed step-scaling method enable DiSPo to outperform in three coarse-to-fine benchmark tests with maximum 81% higher success rate than baselines. In addition, DiSPo improves inference efficiency by generating coarse motions in less critical regions. We finally demonstrate the scalability of actions with simulation and real-world manipulation tasks.
