Table of Contents
Fetching ...

Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence

Jinxian Zhou, Ruihai Wu, Yiwei Liu, Yiwen Hou, Xunzhe Zhou, Checheng Yu, Licheng Zhong, Lin Shao

TL;DR

Bi-Adapt addresses the challenge of generalizing bimanual manipulation to unseen object categories with limited data. It combines a foundation-model-guided affordance transfer via semantic correspondence with a few-shot adaptation stage to refine action directions. The approach builds a supporting set of articulated objects, transfers contact-point affordances across categories using diffusion-feature-based semantic matching, and then fine-tunes the policy with a small number of demonstrations. Experiments in simulation and real-world settings show high success rates across multiple tasks and novel categories with restricted data, indicating practical efficiency and generalization potential.

Abstract

Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: https://biadapt-project.github.io/

Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence

TL;DR

Bi-Adapt addresses the challenge of generalizing bimanual manipulation to unseen object categories with limited data. It combines a foundation-model-guided affordance transfer via semantic correspondence with a few-shot adaptation stage to refine action directions. The approach builds a supporting set of articulated objects, transfers contact-point affordances across categories using diffusion-feature-based semantic matching, and then fine-tunes the policy with a small number of demonstrations. Experiments in simulation and real-world settings show high success rates across multiple tasks and novel categories with restricted data, indicating practical efficiency and generalization potential.

Abstract

Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: https://biadapt-project.github.io/
Paper Structure (24 sections, 3 equations, 8 figures, 2 tables)

This paper contains 24 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) Simulation environment with gripper frames (red, green, blue axes for forward, leftward, upward); (b) Two $SE(3)$-parametrized action primitives, visualized by two key frames with red contact points and time steps from transparent to solid grippers; (c) Action definitions of different tasks.
  • Figure 2: Pipeline.(Left) We first train the Action Learning Network on the supporting set. The corresponding affordance and action distribution serve as our prior knowledge. (Middle) Then, we make contact points mapping from training categories to novel categories, leveraging the foundation model. (Lower Right) After that, the pre-trained network proposes actions based on mapped contact-point pairs on novel categories and fine-tuned with the interaction results. (Upper Right) Finally, the fine-tuned networks can facilitate manipulating unseen instances from novel categories with better performance.
  • Figure 3: (Left) Manipulation results across tasks. (Right) Cross-category affordance generalization. Each group shows a source image with contact points (★: first gripper, ★: second gripper) and a target image with inferred points. Highlight intensity indicates correspondence.
  • Figure 4: Visualization of affordance comparison.
  • Figure 5: Visualization of the adaptation results.
  • ...and 3 more figures