Table of Contents
Fetching ...

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola

TL;DR

This work introduces Distilled Feature Fields (DFF) for robotic manipulation, combining 3D geometry with rich 2D semantic priors by distilling features from vision and vision-language models into a NeRF-based 3D field. The proposed framework, F3RM, enables few-shot grasping and open-text language-guided manipulation by leveraging dense patch-level CLIP and DINO features, accelerated with hierarchical hash grids and MaskCLIP for dense features. It demonstrates open-ended generalization to unseen object categories and cluttered scenes, with language-guided prompts enabling manipulation via free-text queries. The approach advances practical open-world robotics by integrating semantic priors with accurate 3D geometry, achieving meaningful zero-shot or few-shot generalization while maintaining reasonable inference speed, albeit with limitations in grasp precision and data-collection efficiency.

Abstract

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

TL;DR

This work introduces Distilled Feature Fields (DFF) for robotic manipulation, combining 3D geometry with rich 2D semantic priors by distilling features from vision and vision-language models into a NeRF-based 3D field. The proposed framework, F3RM, enables few-shot grasping and open-text language-guided manipulation by leveraging dense patch-level CLIP and DINO features, accelerated with hierarchical hash grids and MaskCLIP for dense features. It demonstrates open-ended generalization to unseen object categories and cluttered scenes, with language-guided prompts enabling manipulation via free-text queries. The approach advances practical open-world robotics by integrating semantic priors with accurate 3D geometry, achieving meaningful zero-shot or few-shot generalization while maintaining reasonable inference speed, albeit with limitations in grasp precision and data-collection efficiency.

Abstract

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.
Paper Structure (53 sections, 5 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 53 sections, 5 equations, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: Distilled Feature Fields Enable Open-Ended Manipulation. (1) Robot uses a selfie stick to scan RGB images of the scene (camera frustums shown). (2) Extract patch-level dense features for the images from a 2D foundation model, and distill them into a feature field (PCA shown) along with modeling a NeRF. (3) We can query CLIP feature fields with language to generate heatmaps and infer 6-DOF grasps on novel objects given only ten demonstrations.
  • Figure 1: Success rates on grasping and placing tasks. We compare the success rates over ten evaluation scenes given two demonstrations for each task. We consider a run successful if the robot grasps or places the correct corresponding object part for the task.
  • Figure 2: Representing 6-DOF Poses. (a) Recording the gripper pose $\mathbf{T}^*$ in virtual reality (VR) on an example mug. (b) We approximate the continuous local field via a fixed set of query points in the gripper's canonical frame. (c) We concatenate feature vectors at these query points, then average over $n$ (we use $n = 2$) demonstrations. This gives a task embedding $\mathbf{Z}_M$ for the task $M$.
  • Figure 3: Pipeline for Language-Guided Manipulation. (a) Encode the language query with CLIP, and compare its similarity to the average query point features over a set of demos. The mug lip demos have the highest similarity to "Pick up the Bowl". (b) Generate and optimize grasp proposals using the CLIP feature field by minimizing $\mathcal{J}_\text{lang}$. We use the selected demo from (a) in $\mathcal{J}_\text{pose}$, and compute the language-guidance weight with the text features and average query point features.
  • Figure 3: Feature Map Resolutions. Resolutions of the features output by the vision models given a $1280 \times 720$ RGB image.
  • ...and 10 more figures