Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola
TL;DR
This work introduces Distilled Feature Fields (DFF) for robotic manipulation, combining 3D geometry with rich 2D semantic priors by distilling features from vision and vision-language models into a NeRF-based 3D field. The proposed framework, F3RM, enables few-shot grasping and open-text language-guided manipulation by leveraging dense patch-level CLIP and DINO features, accelerated with hierarchical hash grids and MaskCLIP for dense features. It demonstrates open-ended generalization to unseen object categories and cluttered scenes, with language-guided prompts enabling manipulation via free-text queries. The approach advances practical open-world robotics by integrating semantic priors with accurate 3D geometry, achieving meaningful zero-shot or few-shot generalization while maintaining reasonable inference speed, albeit with limitations in grasp precision and data-collection efficiency.
Abstract
Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.
