Table of Contents
Fetching ...

Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation

Qianyou Zhao, Yuliang Shen, Xuanran Zhai, Ce Hao, Duidi Wu, Jin Qi, Jie Hu, Qiaojun Yu

TL;DR

This work tackles the challenge of achieving fast, real-time, multi-modal control in diffusion-based robotic policies. It introduces the Hybrid Consistency Policy (HCP), which runs a short stochastic prefix up to an adaptive switching time ${t_s^*}$ and then performs a one-step consistency jump along the probability-flow ODE, guided by time-varying consistency distillation with losses $L_{CTM}$ and $L_{DSM}$. The approach blends a stochastic prefix that preserves diversity with a deterministic jump that ensures fast, coherent action generation, controlled by a switching-time criterion that detects stable mode bifurcations. Empirical results in both simulation and real robots show that HCP narrows the performance gap to the DDPM teacher while reducing action-generation latency by a substantial margin, and it maintains multi-modal coverage across tasks, demonstrating practical accuracy-efficiency trade-offs for robotic policies.

Abstract

In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.

Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation

TL;DR

This work tackles the challenge of achieving fast, real-time, multi-modal control in diffusion-based robotic policies. It introduces the Hybrid Consistency Policy (HCP), which runs a short stochastic prefix up to an adaptive switching time and then performs a one-step consistency jump along the probability-flow ODE, guided by time-varying consistency distillation with losses and . The approach blends a stochastic prefix that preserves diversity with a deterministic jump that ensures fast, coherent action generation, controlled by a switching-time criterion that detects stable mode bifurcations. Empirical results in both simulation and real robots show that HCP narrows the performance gap to the DDPM teacher while reducing action-generation latency by a substantial margin, and it maintains multi-modal coverage across tasks, demonstrating practical accuracy-efficiency trade-offs for robotic policies.

Abstract

In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.

Paper Structure

This paper contains 15 sections, 17 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Hybrid Consistency Policy. SDE models capture multi-modal behaviors but sample slowly, while ODE models are fast yet prone to the risk of mode collapse. HCP runs a short stochastic SDE prefix in the high-noise region to form branches, then at an adaptive switch time $t_s^\ast$ performs a one-step consistency jump along the probability-flow ODE to $x_0$. This yields fast sampling and reliable multi-distribution execution, illustrated by distinct successful robot trajectories.
  • Figure 2: Overview of the HCP architecture. (a) Policy pipeline: robot state and multi-view images are encoded by MLP and ResNet. Inference executes action steps at equal time intervals through an action chunk. (b) Hybrid sampling: a DDPM teacher supplies stochastic trajectories, while a student is trained via consistency distillation to satisfy a one-step ODE mapping in the contract region.
  • Figure 3: Real-world setup and sensors. A 7-DoF collaborative arm operates in a fixed workspace with calibrated TCP pose. A wrist camera (RealSense D415) and a third-person camera (RealSense D435) provide multi-view observations.
  • Figure 4: Real-world tasks. (a), (b) and (c) are multi-modal tasks; (d), (e) and (f) are single-modal tasks. All demonstrations are collected via VR teleoperation with 60 successful demos per task.
  • Figure 5: Performance of HCP in real-world multi-modal tasks. HCP successfully achieves multiple modes for each task, demonstrating performance close to the Teacher in terms of accuracy and multi-modal capabilities.
  • ...and 3 more figures