Table of Contents
Fetching ...

BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation

Dongjie Yu, Hang Xu, Yizhou Chen, Yi Ren, Jia Pan

TL;DR

BiKC tackles the challenge of reliable and fast multi-stage bimanual manipulation by introducing a hierarchical policy that combines a high-level keypose predictor with a low-level Consistency-Model trajectory generator trained from scratch. The keyposes serve as subgoals and stage boundaries, guiding short-horizon action sequences with one-step inference to reduce latency. Across simulated MuJoCo tasks and real-world ALOHA experiments, BiKC improves overall task success rates and operational efficiency compared with ACT, DP, and CP baselines, while modeling multi-modal demonstrations via the CM. The work highlights the value of integrating keypose-driven planning with fast, consistent visuomotor policies for practical bimanual manipulation, and discusses limitations and future directions for force sensing and scene-centric representations.

Abstract

Bimanual manipulation tasks typically involve multiple stages which require efficient interactions between two arms, posing step-wise and stage-wise challenges for imitation learning systems. Specifically, failure and delay of one step will broadcast through time, hinder success and efficiency of each sub-stage task, and thereby overall task performance. Although recent works have made strides in addressing certain challenges, few approaches explicitly consider the multi-stage nature of bimanual tasks while simultaneously emphasizing the importance of inference speed. In this paper, we introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation. It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator. The predicted keyposes provide guidance for trajectory generation and also mark the completion of one sub-stage task. The trajectory generator is designed as a consistency model trained from scratch without distillation, which generates action sequences conditioning on current observations and predicted keyposes with fast inference speed. Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency. Codes are available at https://github.com/ManUtdMoon/BiKC.

BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation

TL;DR

BiKC tackles the challenge of reliable and fast multi-stage bimanual manipulation by introducing a hierarchical policy that combines a high-level keypose predictor with a low-level Consistency-Model trajectory generator trained from scratch. The keyposes serve as subgoals and stage boundaries, guiding short-horizon action sequences with one-step inference to reduce latency. Across simulated MuJoCo tasks and real-world ALOHA experiments, BiKC improves overall task success rates and operational efficiency compared with ACT, DP, and CP baselines, while modeling multi-modal demonstrations via the CM. The work highlights the value of integrating keypose-driven planning with fast, consistent visuomotor policies for practical bimanual manipulation, and discusses limitations and future directions for force sensing and scene-centric representations.

Abstract

Bimanual manipulation tasks typically involve multiple stages which require efficient interactions between two arms, posing step-wise and stage-wise challenges for imitation learning systems. Specifically, failure and delay of one step will broadcast through time, hinder success and efficiency of each sub-stage task, and thereby overall task performance. Although recent works have made strides in addressing certain challenges, few approaches explicitly consider the multi-stage nature of bimanual tasks while simultaneously emphasizing the importance of inference speed. In this paper, we introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation. It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator. The predicted keyposes provide guidance for trajectory generation and also mark the completion of one sub-stage task. The trajectory generator is designed as a consistency model trained from scratch without distillation, which generates action sequences conditioning on current observations and predicted keyposes with fast inference speed. Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency. Codes are available at https://github.com/ManUtdMoon/BiKC.
Paper Structure (15 sections, 10 equations, 7 figures, 4 tables)

This paper contains 15 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of the keypose-conditioned workflow.
  • Figure 2: Extracting bimanual keyposes by merging ones of each arm.
  • Figure 3: Extracting keypose and local trajectory samples from demonstrations.
  • Figure 4: (a) Snapshots of sub-stages and predicted keyposes (green for left arm while orange for right arm). (b) Rendered RGB observation.
  • Figure 5: Real-world ALOHA platform.
  • ...and 2 more figures