Table of Contents
Fetching ...

KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

Grace Tang, Swetha Rajkumar, Yifei Zhou, Homer Rich Walke, Sergey Levine, Kuan Fang

TL;DR

This work proposes Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner and can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.

Abstract

Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.

KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

TL;DR

This work proposes Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner and can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.

Abstract

Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.
Paper Structure (17 sections, 2 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 2 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Overview of KALIE. By fine-tuning a pre-trained VLM, KALIE predicts the point-based affordance representation given the input task instruction and visual observation. Based on limited example data collected by humans, KALIE generates synthetic data with high diversity while preserving the task semantics and the keypoint annotations. The fine-tuned VLM can robustly generate motions for tasks with unseen objects and arrangements.
  • Figure 2: Affordance-aware data synthesis. KALIE employs the inpainting capability of a diffusion model to generate synthetic data. To diversify the scenes while staying faithful to the task semantics and the keypoint annotations, KALIE computes and transforms the context, such as soft edges, to guide the generation process.
  • Figure 3: Synthetic data examples. In each column, we show example synthetic images generated based on an example image for each task. The original and transformed point-based affordances are plotted on top of the images.
  • Figure 4: Comparisons with alternative synthesis algorithms. KALIE generates much more robust samples comparing with generation without conditioning on the original images or the context.
  • Figure 5: Comparisons with vanilla data augmentation. Mean Square Error (MSE) for each keypoint affordance on a test set of novel objects is reported for the table sweeping task.
  • ...and 2 more figures