Table of Contents
Fetching ...

Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Xiaolin Fang, Bo-Ruei Huang, Jiayuan Mao, Jasmine Shone, Joshua B. Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling

TL;DR

KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes.

Abstract

Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: https://kalm-il.github.io/

Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

TL;DR

KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes.

Abstract

Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: https://kalm-il.github.io/

Paper Structure

This paper contains 23 sections, 3 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Keypoint Abstraction using Large Models for Object-Relative Imitation Learning (KALM). KALM is a framework that distills keypoint abstraction by prompting and verifying keypoint proposals from large pre-trained models using a small amount of robot demonstration data, which is used to train a keypoint-conditioned policy model. Our method demonstrates strong generalization on multiple real-world manipulation tasks with only 10 demonstrations and no additional labeling effort.
  • Figure 2: KALM overview.(a) Keypoint distillation. Given a demonstration video and a task description, we prompt a VLM to generate a coarse-grained region proposal, which is refined into a fine-grained point set via image segmentation models and VLMs. We use a keypoint detection function $\phi$ to identify keypoint correspondences across a handful of demonstration trajectories. The final keypoints set is selected based on correspondence consistency verification. These keypoints are used for training a keypoint-conditioned action model. (b) Inference time. Given a new scene, the keypoint detection function $\phi$ localizes the distilled keypoints. The learned keypoint-conditioned action prediction model generates an object-relative end-effector trajectory based on the keypoint positions and features.
  • Figure 3: Testing tasks in Meta-World yu2019meta simulator. We evaluate on 5 tasks in Meta-World with randomized camera and object poses, necessitating the generalization of policies across observational changes. Keypoints are marked in pink for visualization.
  • Figure 4: Data efficiency. We measure the average success rate across all 5 tasks, with the number of demonstrations increasing from 10 to 500. Our method, KALM, demonstrates superior data efficiency compared to all baselines.
  • Figure 5: Testing tasks in the real world. We evaluate different methods on three tasks in the real world with different objects at different poses, and with different camera angles. The testing assets are illustrated in the figure.