Table of Contents
Fetching ...

ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation

Guanxing Lu, Zifeng Gao, Tianxing Chen, Wenxun Dai, Ziwei Wang, Wenbo Ding, Yansong Tang

TL;DR

The paper tackles the bottleneck of slow inference in diffusion-based policies for 3D robotic manipulation. It proposes ManiCM, a manipulation consistency model that enforces self-consistency to enable one-step action generation conditioned on 3D point clouds, and uses consistency distillation to train from a teacher diffusion model. The approach yields about a 10x speedup while maintaining competitive success across 31 tasks in Adroit and Metaworld, with real-world validation on UR3e hardware. This work significantly advances real-time deployment of diffusion-based policies in complex 3D manipulation scenarios and lays groundwork for scalable, high-frequency robotics control.

Abstract

Diffusion models have been verified to be effective in generating complex distributions from natural images to motion trajectories. Recent diffusion-based methods show impressive performance in 3D robotic manipulation tasks, whereas they suffer from severe runtime inefficiency due to multiple denoising steps, especially with high-dimensional observations. To this end, we propose a real-time robotic manipulation model named ManiCM that imposes the consistency constraint to the diffusion process, so that the model can generate robot actions in only one-step inference. Specifically, we formulate a consistent diffusion process in the robot action space conditioned on the point cloud input, where the original action is required to be directly denoised from any point along the ODE trajectory. To model this process, we design a consistency distillation technique to predict the action sample directly instead of predicting the noise within the vision community for fast convergence in the low-dimensional action manifold. We evaluate ManiCM on 31 robotic manipulation tasks from Adroit and Metaworld, and the results demonstrate that our approach accelerates the state-of-the-art method by 10 times in average inference speed while maintaining competitive average success rate.

ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation

TL;DR

The paper tackles the bottleneck of slow inference in diffusion-based policies for 3D robotic manipulation. It proposes ManiCM, a manipulation consistency model that enforces self-consistency to enable one-step action generation conditioned on 3D point clouds, and uses consistency distillation to train from a teacher diffusion model. The approach yields about a 10x speedup while maintaining competitive success across 31 tasks in Adroit and Metaworld, with real-world validation on UR3e hardware. This work significantly advances real-time deployment of diffusion-based policies in complex 3D manipulation scenarios and lays groundwork for scalable, high-frequency robotics control.

Abstract

Diffusion models have been verified to be effective in generating complex distributions from natural images to motion trajectories. Recent diffusion-based methods show impressive performance in 3D robotic manipulation tasks, whereas they suffer from severe runtime inefficiency due to multiple denoising steps, especially with high-dimensional observations. To this end, we propose a real-time robotic manipulation model named ManiCM that imposes the consistency constraint to the diffusion process, so that the model can generate robot actions in only one-step inference. Specifically, we formulate a consistent diffusion process in the robot action space conditioned on the point cloud input, where the original action is required to be directly denoised from any point along the ODE trajectory. To model this process, we design a consistency distillation technique to predict the action sample directly instead of predicting the noise within the vision community for fast convergence in the low-dimensional action manifold. We evaluate ManiCM on 31 robotic manipulation tasks from Adroit and Metaworld, and the results demonstrate that our approach accelerates the state-of-the-art method by 10 times in average inference speed while maintaining competitive average success rate.
Paper Structure (15 sections, 7 equations, 6 figures, 4 tables)

This paper contains 15 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Trade-off Between Efficiency and Effectiveness. We present ManiCM, a real-time 3D diffusion policy by imposing the consistency constraint on the diffusion process. BCRNN+3D mandlekar2021matters is a state-of-the-art perceptive model-based behavior cloning agent augmented with 3D point cloud input. DP chi2023diffusionpolicy and DP3 Ze2024DP3 are the state-of-the-art diffusion-based manipulation agents. ManiCM achieves a decision-making runtime of 16ms, which is 10$\times$ faster than previous mainstream methods.
  • Figure 2: Overall Pipeline. Given a raw action sequence $\boldsymbol{a}_{0}$, we first perform a forward diffusion to introduce noise over $n + k$ steps. The resulting noisy sequence $\boldsymbol{a}_{n+k}$ is then fed into both the online network and the teacher network to predict the clean action sequence. The target network uses the teacher network's $k$-step estimation results to predict the action sequence. To enforce self-consistency, a loss function is applied to ensure that the outputs of the online network and the target network are consistent.
  • Figure 3: Learning Curve. Learning curves of ManiCM with sample prediction vs. noise prediction. ManiCM converges remarkably faster by predicting action sample directly than noise.
  • Figure 4: Qualitative Comparisons. We compare ManiCM with the state-of-the-art method DP3 Ze2024DP3 in two typical manipulation tasks from Adroit and Metaworld, respectively. With only one-step inference, ManiCM achieves the fastest action generation while producing high-quality motions that successfully complete the tasks.
  • Figure 5: Multi-Task performance Comparison. ManiCM balances accuracy with 3D Diffuser Actor 3d_diffuser_actor while achieving ×13.3 faster inference speeds on RLBench.
  • ...and 1 more figures