Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

William Shen; Ge Yang; Alan Yu; Jansen Wong; Leslie Pack Kaelbling; Phillip Isola

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola

TL;DR

This work introduces Distilled Feature Fields (DFF) for robotic manipulation, combining 3D geometry with rich 2D semantic priors by distilling features from vision and vision-language models into a NeRF-based 3D field. The proposed framework, F3RM, enables few-shot grasping and open-text language-guided manipulation by leveraging dense patch-level CLIP and DINO features, accelerated with hierarchical hash grids and MaskCLIP for dense features. It demonstrates open-ended generalization to unseen object categories and cluttered scenes, with language-guided prompts enabling manipulation via free-text queries. The approach advances practical open-world robotics by integrating semantic priors with accurate 3D geometry, achieving meaningful zero-shot or few-shot generalization while maintaining reasonable inference speed, albeit with limitations in grasp precision and data-collection efficiency.

Abstract

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

TL;DR

Abstract

Paper Structure (53 sections, 5 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 53 sections, 5 equations, 15 figures, 3 tables, 2 algorithms.

Introduction
Problem Formulation
Few-Shot Manipulation.
Open-Text Language-Guided Manipulation.
Feature Fields for Robotic Manipulation (F3RM)
Feature Field Distillation
Feature Distillation.
Extracting Dense Visual Features from CLIP.
Representing 6-DOF Poses with Feature Fields
Inferring 6-DOF Poses.
Pose Optimization.
Open-Text Language-Guided Manipulation
Retrieving Relevant Demonstrations.
Initializing Grasp Proposals.
Language-Guided Grasp Pose Optimization.
...and 38 more sections

Figures (15)

Figure 1: Distilled Feature Fields Enable Open-Ended Manipulation. (1) Robot uses a selfie stick to scan RGB images of the scene (camera frustums shown). (2) Extract patch-level dense features for the images from a 2D foundation model, and distill them into a feature field (PCA shown) along with modeling a NeRF. (3) We can query CLIP feature fields with language to generate heatmaps and infer 6-DOF grasps on novel objects given only ten demonstrations.
Figure 1: Success rates on grasping and placing tasks. We compare the success rates over ten evaluation scenes given two demonstrations for each task. We consider a run successful if the robot grasps or places the correct corresponding object part for the task.
Figure 2: Representing 6-DOF Poses. (a) Recording the gripper pose $\mathbf{T}^*$ in virtual reality (VR) on an example mug. (b) We approximate the continuous local field via a fixed set of query points in the gripper's canonical frame. (c) We concatenate feature vectors at these query points, then average over $n$ (we use $n = 2$) demonstrations. This gives a task embedding $\mathbf{Z}_M$ for the task $M$.
Figure 3: Pipeline for Language-Guided Manipulation. (a) Encode the language query with CLIP, and compare its similarity to the average query point features over a set of demos. The mug lip demos have the highest similarity to "Pick up the Bowl". (b) Generate and optimize grasp proposals using the CLIP feature field by minimizing $\mathcal{J}_\text{lang}$. We use the selected demo from (a) in $\mathcal{J}_\text{pose}$, and compute the language-guidance weight with the text features and average query point features.
Figure 3: Feature Map Resolutions. Resolutions of the features output by the vision models given a $1280 \times 720$ RGB image.
...and 10 more figures

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

TL;DR

Abstract

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)