Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation
Qianyou Zhao, Yuliang Shen, Xuanran Zhai, Ce Hao, Duidi Wu, Jin Qi, Jie Hu, Qiaojun Yu
TL;DR
This work tackles the challenge of achieving fast, real-time, multi-modal control in diffusion-based robotic policies. It introduces the Hybrid Consistency Policy (HCP), which runs a short stochastic prefix up to an adaptive switching time ${t_s^*}$ and then performs a one-step consistency jump along the probability-flow ODE, guided by time-varying consistency distillation with losses $L_{CTM}$ and $L_{DSM}$. The approach blends a stochastic prefix that preserves diversity with a deterministic jump that ensures fast, coherent action generation, controlled by a switching-time criterion that detects stable mode bifurcations. Empirical results in both simulation and real robots show that HCP narrows the performance gap to the DDPM teacher while reducing action-generation latency by a substantial margin, and it maintains multi-modal coverage across tasks, demonstrating practical accuracy-efficiency trade-offs for robotic policies.
Abstract
In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.
