Table of Contents
Fetching ...

Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation

Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang

TL;DR

This work tackles the slow inference of diffusion-based visuomotor policies by introducing the Score and Distribution Matching Policy (SDM Policy), which distills a diffusion teacher into a fast one-step generator. SDM Policy combines score matching to align the generated actions with the true action distribution and distribution matching (KL divergence) to enforce global consistency, guided by a dual-teacher framework consisting of a frozen stabilizer and an unfrozen adversarial guide. Across a 57-task simulated benchmark, SDM Policy achieves approximately a 6× speedup with state-of-the-art action quality, bringing diffusion-based control into practical high-frequency robotics. The approach enables reliable, efficient visuomotor policies and highlights a promising direction for fast, accurate imitation learning in dynamic robotic tasks.

Abstract

Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories. However, their prolonged inference times hinder high-frequency control tasks requiring real-time feedback. While consistency distillation (CD) accelerates inference, it introduces errors that compromise action quality. To address these limitations, we propose the Score and Distribution Matching Policy (SDM Policy), which transforms diffusion-based policies into single-step generators through a two-stage optimization process: score matching ensures alignment with true action distributions, and distribution matching minimizes KL divergence for consistency. A dual-teacher mechanism integrates a frozen teacher for stability and an unfrozen teacher for adversarial training, enhancing robustness and alignment with target distributions. Evaluated on a 57-task simulation benchmark, SDM Policy achieves a 6x inference speedup while having state-of-the-art action quality, providing an efficient and reliable framework for high-frequency robotic tasks.

Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation

TL;DR

This work tackles the slow inference of diffusion-based visuomotor policies by introducing the Score and Distribution Matching Policy (SDM Policy), which distills a diffusion teacher into a fast one-step generator. SDM Policy combines score matching to align the generated actions with the true action distribution and distribution matching (KL divergence) to enforce global consistency, guided by a dual-teacher framework consisting of a frozen stabilizer and an unfrozen adversarial guide. Across a 57-task simulated benchmark, SDM Policy achieves approximately a 6× speedup with state-of-the-art action quality, bringing diffusion-based control into practical high-frequency robotics. The approach enables reliable, efficient visuomotor policies and highlights a promising direction for fast, accurate imitation learning in dynamic robotic tasks.

Abstract

Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories. However, their prolonged inference times hinder high-frequency control tasks requiring real-time feedback. While consistency distillation (CD) accelerates inference, it introduces errors that compromise action quality. To address these limitations, we propose the Score and Distribution Matching Policy (SDM Policy), which transforms diffusion-based policies into single-step generators through a two-stage optimization process: score matching ensures alignment with true action distributions, and distribution matching minimizes KL divergence for consistency. A dual-teacher mechanism integrates a frozen teacher for stability and an unfrozen teacher for adversarial training, enhancing robustness and alignment with target distributions. Evaluated on a 57-task simulation benchmark, SDM Policy achieves a 6x inference speedup while having state-of-the-art action quality, providing an efficient and reliable framework for high-frequency robotic tasks.

Paper Structure

This paper contains 23 sections, 6 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: SDM Policy is a visual imitation learning algorithm that trains a one-step generator by enforcing a matching loss between two distributions. This approach balances fast inference speed and action accuracy, achieving state-of-the-art performance. (a) illustrates the principle of our method, (b) provides a comparison between SDM Policy, diffusion policy, and current SOTA methods (ManiCM), and (c) demonstrates that our method surpasses the current SOTA in task success rate and inference speed, showing that the quality of our actions is closer to the teacher model, resulting in more accurate action learning.
  • Figure 2: Overview of SDM Policy. Our method distills diffusion policies, which require long inference times and high computational costs, into a fast and stable one-step generator. Our SDM Policy is represented by the one-step generator, which requires continual correction and optimization via the Corrector during training, but relies solely on the generator during evaluation. The corrector's optimization is based on two components: gradient optimization and diffusion optimization. The gradient optimization part primarily involves optimizing the entire distribution by minimizing the KL divergence between two distributions, $P_{\theta}$ and $D_{\theta}$, with distribution details represented through a score function that guides the gradient update direction, providing a clear signal. The diffusion optimization component enables $D_{\theta}$ to quickly track changes in the one-step generator’s output, maintaining consistency. Details on loading observational data for both evaluation and training processes are provided above the diagram. Our method applies to both 2D and 3D scenarios.
  • Figure 3: Performance of score estimation in low-density regions. The purple rectangle represents low-density regions, and the pink rectangle represents high-density regions. For the entire rectangle, darker colors indicate higher density. The left image shows the true data scores, while the right image shows the estimated scores. In the high-density pink rectangle, the difference between the estimated and true scores is minimal. However, in the low-density purple rectangle, the difference between the estimated and true scores is significantly larger, indicating poor score matching performance in low-density regions.
  • Figure 4: Comparison of SDM Policy and consistent distillation. Here we provide a detailed comparison of the differences in the training process between consistency distillation, latent consistency distillation, and our SDM Policy. Consistency distillation suffers from significant deviations in one-step generation due to error accumulation, while latent consistency distillation quickly overlooks the need for global consistency. In contrast, our method aligns and learns at the distribution level, effectively addressing the issues mentioned above.
  • Figure 5: Learning efficiency. We sampled 10 simulation tasks and presented the learning curves of our SDM Policy alongside DP3 and ManiCM. SDM Policy demonstrated a rapid convergence rate. In contrast, ManiCM showed slower learning progress, and DP3’s convergence speed was also slower than our method.
  • ...and 3 more figures