Table of Contents
Fetching ...

COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping

Jun Yamada, Alexander L. Mitchell, Jack Collins, Ingmar Posner

TL;DR

COMBO-Grasp tackles occluded grasping by decoupling stabilization and manipulation into two coordinated policies: a self-supervised constraint policy that stabilises the object with one arm, and a reinforcement learning grasping policy that reorients and grasps with the other arm. A key innovation is value-function-guided policy coordination, which refines the constraint output via gradients from a jointly trained value function to improve bimanual coordination and sample efficiency. The approach is complemented by teacher-student distillation, enabling reliable sim-to-real transfer using vision-based student policies that operate on point clouds. Empirical results show superior task success and generalisation to unseen objects in both simulation and real-world settings, outperforming competitive baselines. The work offers a practical, data-efficient pathway for robust bimanual occluded grasping in cluttered or constrained environments.

Abstract

This paper addresses the challenge of occluded robot grasping, i.e. grasping in situations where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Traditional robot manipulation approaches struggle with the complexity of non-prehensile or bimanual strategies commonly used by humans in these circumstances. State-of-the-art reinforcement learning (RL) methods are unsuitable due to the inherent complexity of the task. In contrast, learning from demonstration requires collecting a significant number of expert demonstrations, which is often infeasible. Instead, inspired by human bimanual manipulation strategies, where two hands coordinate to stabilise and reorient objects, we focus on a bimanual robotic setup to tackle this challenge. In particular, we introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), a learning-based approach which leverages two coordinated policies: a constraint policy trained using self-supervised datasets to generate stabilising poses and a grasping policy trained using RL that reorients and grasps the target object. A key contribution lies in value function-guided policy coordination. Specifically, during RL training for the grasping policy, the constraint policy's output is refined through gradients from a jointly trained value function, improving bimanual coordination and task performance. Lastly, COMBO-Grasp employs teacher-student policy distillation to effectively deploy point cloud-based policies in real-world environments. Empirical evaluations demonstrate that COMBO-Grasp significantly improves task success rates compared to competitive baseline approaches, with successful generalisation to unseen objects in both simulated and real-world environments.

COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping

TL;DR

COMBO-Grasp tackles occluded grasping by decoupling stabilization and manipulation into two coordinated policies: a self-supervised constraint policy that stabilises the object with one arm, and a reinforcement learning grasping policy that reorients and grasps with the other arm. A key innovation is value-function-guided policy coordination, which refines the constraint output via gradients from a jointly trained value function to improve bimanual coordination and sample efficiency. The approach is complemented by teacher-student distillation, enabling reliable sim-to-real transfer using vision-based student policies that operate on point clouds. Empirical results show superior task success and generalisation to unseen objects in both simulation and real-world settings, outperforming competitive baselines. The work offers a practical, data-efficient pathway for robust bimanual occluded grasping in cluttered or constrained environments.

Abstract

This paper addresses the challenge of occluded robot grasping, i.e. grasping in situations where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Traditional robot manipulation approaches struggle with the complexity of non-prehensile or bimanual strategies commonly used by humans in these circumstances. State-of-the-art reinforcement learning (RL) methods are unsuitable due to the inherent complexity of the task. In contrast, learning from demonstration requires collecting a significant number of expert demonstrations, which is often infeasible. Instead, inspired by human bimanual manipulation strategies, where two hands coordinate to stabilise and reorient objects, we focus on a bimanual robotic setup to tackle this challenge. In particular, we introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), a learning-based approach which leverages two coordinated policies: a constraint policy trained using self-supervised datasets to generate stabilising poses and a grasping policy trained using RL that reorients and grasps the target object. A key contribution lies in value function-guided policy coordination. Specifically, during RL training for the grasping policy, the constraint policy's output is refined through gradients from a jointly trained value function, improving bimanual coordination and task performance. Lastly, COMBO-Grasp employs teacher-student policy distillation to effectively deploy point cloud-based policies in real-world environments. Empirical evaluations demonstrate that COMBO-Grasp significantly improves task success rates compared to competitive baseline approaches, with successful generalisation to unseen objects in both simulated and real-world environments.

Paper Structure

This paper contains 43 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: We introduce COMBO-Grasp, a bimanual robotic system that uses two coordinated policies to address the challenges of grasping objects when the grasp pose is occluded. The system leverages a constraint policy that predicts the pose for the right arm to support the left arm during manipulation. Task execution unfolds in the following sequence: (1) the right arm moves to the predicted support pose using motion planning, (2) the left arm uses the constraint to grasp the target object, (3) the right arm returns to its home position, and (4) the left arm lifts the target object to complete the task.
  • Figure 2: Real-world system setup. The system comprises two Kinova Gen3 robotic arms mounted perpendicularly to the main body. Each arm is equipped with a Robotiq 2F-85 gripper. To enhance grasping performance, the grippers are fitted with soft fingertips chi2024universalmanipulationinterfaceinthewild instead of the standard ones. Visual observations are captured using a third-person RealSense L515 camera positioned in front of the robot.
  • Figure 3: Method Overview. (1) COMBO-Grasp first collects a synthetic dataset in a self-supervised manner in simulation to train the state-based teacher constraint policy. The teacher constraint policy outputs an end-effector pose for the right arm, given the privileged information available in the simulation. (2) The weights of the trained teacher constraint policy are frozen, and a teacher grasping policy, $\pi_{teacher}$, is trained using RL from privileged information in simulation. To maximise the performance, we propose value function-guided policy coordination that refines the output of the constraint policy using gradients propagated from the value function that is jointly trained with the grasping policy by maximising its value. (3) The teacher grasping policy and the teacher constraint policy are distilled into vision-based student policies that leverage point cloud observations, robot proprioceptive states, and, optionally, a desired grasp pose to address bimanual occluded grasping tasks in real-world environments.
  • Figure 4: Student policy architecture. We utilize DP3 ze20243d as the backbone for the grasping policy. The DP3 encoder processes the scene point cloud, and its output is concatenated with a state feature vector obtained by a multi-layer perceptron (MLP). The resulting concatenated vector serves as the conditioning input for the diffusion-based policy. Similarly, the constraint student policy employs the DP3 encoder and an MLP, but it takes a desired grasp pose as input. Unlike the grasping policy, the constraint student policy employs a Gaussian Mixture Model (GMM)-based policy.
  • Figure 5: Teacher policy training. We run $3$ seeds for each method, and the shaded region represents the standard deviation. COMBO-Grasp significantly outperforms competitive baselines in both performance and sample efficiency.
  • ...and 6 more figures