Table of Contents
Fetching ...

Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints

Jianshu Hu, Lidi Wang, Shujia Li, Yunpeng Jiang, Xiao Li, Paul Weng, Yutong Ban

TL;DR

This work proposes Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation that generalizes to novel instructions and environments.

Abstract

Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.

Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints

TL;DR

This work proposes Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation that generalizes to novel instructions and environments.

Abstract

Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.

Paper Structure

This paper contains 30 sections, 4 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Intuition of CLAP. Our method achieves strong generalization ability by decomposing tasks into step-wise language instructions, each aligned with a 3D keypoint.
  • Figure 2: Overview of CLAP. We propose a novel coarse-to-fine 3D manipulation policy, comprising of a coarse task planner and a fine-grained action predictor. The coarse task planner reasons about the task plans and the positions of task-related objects to generate language-aligned 3D keypoints. The fine-grained action predictor fuses the corresponding step instruction with a 3D-aware visual representation from refined observations to predict the final action.
  • Figure 3: Overview of the tasks in real-world experiments. There are four training tasks: put shape in shape sorter, put block in cup, open drawer, put block in drawer. We evaluate the same tasks under different visual perturbations and novel tasks designed based on the training tasks.
  • Figure 4: Overview of the evaluation tasks in real-world experiments. We evaluate the all these eight tasks acroos different variations and record the success rate.