Table of Contents
Fetching ...

DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization

Nayoung Oh, Jaehyeong Jang, Moonkyeong Jung, Daehyung Park

TL;DR

DiSPo introduces a diffusion-SSM policy that blends diffusion-based denoising with a Mamba state-space model to enable multi-granularity imitation learning from coarse demonstrations. It modulates action discretization online via step-scale factors and uses pseudo demonstrations to synthesize high-frequency actions, achieving robust coarse-to-fine reproduction in simulation and on UR5e hardware. Key contributions include the step-scaled SSM, step-scale predictor, and a two-stage training regime (pretraining with frequency augmentation and pseudo-demo fine-tuning), resulting in improved success rates and inference efficiency. The approach is particularly impactful for scalable, memory-efficient manipulation where fine-grained actions must be generated on demand while leveraging coarse demonstrations.

Abstract

We aim to solve the problem of generating coarse-to-fine skills learning from demonstrations (LfD). To scale precision, traditional LfD approaches often rely on extensive fine-grained demonstrations with external interpolations or dynamics models with limited generalization capabilities. For memory-efficient learning and convenient granularity change, we propose a novel diffusion-SSM based policy (DiSPo) that learns from diverse coarse skills and produces varying control scales of actions by leveraging a state-space model, Mamba. Our evaluations show the adoption of Mamba and the proposed step-scaling method enable DiSPo to outperform in three coarse-to-fine benchmark tests with maximum 81% higher success rate than baselines. In addition, DiSPo improves inference efficiency by generating coarse motions in less critical regions. We finally demonstrate the scalability of actions with simulation and real-world manipulation tasks.

DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization

TL;DR

DiSPo introduces a diffusion-SSM policy that blends diffusion-based denoising with a Mamba state-space model to enable multi-granularity imitation learning from coarse demonstrations. It modulates action discretization online via step-scale factors and uses pseudo demonstrations to synthesize high-frequency actions, achieving robust coarse-to-fine reproduction in simulation and on UR5e hardware. Key contributions include the step-scaled SSM, step-scale predictor, and a two-stage training regime (pretraining with frequency augmentation and pseudo-demo fine-tuning), resulting in improved success rates and inference efficiency. The approach is particularly impactful for scalable, memory-efficient manipulation where fine-grained actions must be generated on demand while leveraging coarse demonstrations.

Abstract

We aim to solve the problem of generating coarse-to-fine skills learning from demonstrations (LfD). To scale precision, traditional LfD approaches often rely on extensive fine-grained demonstrations with external interpolations or dynamics models with limited generalization capabilities. For memory-efficient learning and convenient granularity change, we propose a novel diffusion-SSM based policy (DiSPo) that learns from diverse coarse skills and produces varying control scales of actions by leveraging a state-space model, Mamba. Our evaluations show the adoption of Mamba and the proposed step-scaling method enable DiSPo to outperform in three coarse-to-fine benchmark tests with maximum 81% higher success rate than baselines. In addition, DiSPo improves inference efficiency by generating coarse motions in less critical regions. We finally demonstrate the scalability of actions with simulation and real-world manipulation tasks.
Paper Structure (10 sections, 7 equations, 9 figures, 1 table)

This paper contains 10 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of DiSPo: a diffusion-SSM based policy for coarse-to-fine imitation learning. Leveraging the representation power of diffusion policy and the flexible discretization capabilities of Mamba architecture, DiSPo learns from multi-granularity demonstrations (e.g., 2.5Hz and 5Hz) and generates actions at user-intended frequencies. DiSPo demonstrates improved accuracy and inference efficiency in fine-grained manipulation tasks compared to baseline methods.
  • Figure 2: Illustration of the DiSPo architecture. DiSPo takes diffusion step $k$, step-scale factors ${\bf r}_t$, encoded observations $\mathbf{o}_{t-T_o+1:t}$, and noisy actions $\mathbf{a}^{(k)}_{t-T_o+1:t+T_a}$. The model identifies the noise $\hat{\varepsilon}^{(k)}_{t-T_o+1:t+T_a}$ within the input noisy actions through stacked DiSPo blocks and utilizes the identified noise to generate the less noisy action $\mathbf{a}^{(k-1)}_{t-T_o+1:t+T_a}$ from the previous noisy action.
  • Figure 3: (a) A DiSPo block $\mathcal{M}_i$ refines noise-related features in the type encoded sequence $\mathbf{u}_t^{(i)}$ using adaLN conditioned on the diffusion step embedding $\mathbf{k}$. (b) A step-scaled Mamba block takes ${\bf r}_t$ and $^{\dagger}\mathbf{u}_t^{(i)}$.
  • Figure 5: Generating a pseudo demonstration for fine-tuning. Starting from Gaussian noise $\varepsilon^{(K)}$ and a reference sequence $\tau_0$, the model iteratively denoises and replaces $w_0$ frequency actions in the less noisy action sequence with noise added $\mathbf{a}_{w_0}\in\tau_0$. We repeat this process until the model generates a noise-less action sequence at target frequency $\mathbf{a}_{w_{\text{target}}}^{(0)}$, which we refer to as a pseudo demonstration.
  • Figure 6: Illustrations of three simulation benchmarks, clamp passing, passage passing, and button touch. Dots denote either demonstrations at 2.5Hz or predicted actions from DiSPo and baselines.
  • ...and 4 more figures