Table of Contents
Fetching ...

Learning Generalizable Feature Fields for Mobile Manipulation

Ri-Zhao Qiu, Yafei Hu, Yuchen Song, Ge Yang, Yang Fu, Jianglong Ye, Jiteng Mu, Ruihan Yang, Nikolay Atanasov, Sebastian Scherer, Xiaolong Wang

TL;DR

GeFF introduces a scene-level generalizable feature field that unifies navigation and manipulation for mobile robots. It pre-trains a neural scene encoder via novel-view synthesis and aligns its latent geometry and semantics with language through CLIP-based distillation, resulting in real-time open-world perception. The approach supports both object- and part-level manipulation and semantics-aware navigation, outperforming existing baselines in runtime and storage-accuracy trade-offs. Demonstrated on a quadruped robot with a manipulator, GeFF offers zero-shot capabilities and practical impact for open-world robotic perception and manipulation.

Abstract

An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.

Learning Generalizable Feature Fields for Mobile Manipulation

TL;DR

GeFF introduces a scene-level generalizable feature field that unifies navigation and manipulation for mobile robots. It pre-trains a neural scene encoder via novel-view synthesis and aligns its latent geometry and semantics with language through CLIP-based distillation, resulting in real-time open-world perception. The approach supports both object- and part-level manipulation and semantics-aware navigation, outperforming existing baselines in runtime and storage-accuracy trade-offs. Demonstrated on a quadruped robot with a manipulator, GeFF offers zero-shot capabilities and practical impact for open-world robotic perception and manipulation.

Abstract

An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.
Paper Structure (10 sections, 6 equations, 4 figures, 4 tables)

This paper contains 10 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Pre-trained as a generalizable NeRF encoder, GeFF provides a unified scene representation to support robot tasks from a onboard RGB-D stream, offering both real-time geometric information for planning and language-grounded semantics query capability. Compared to LERF Kerr2023-LERF, GeFF runs in real-time without costly per-scene optimization, which enables many potential robotics applications. We demonstrate the efficacy of GeFF in open-world language-conditioned mobile manipulation. Feature visualizations are done by running PCA on high-dimensional feature vectors and normalizing the 3 main components as RGB.
  • Figure 2: Generalizable NeRFs acquire geometric and semantic priors: RGB images are input views from ScanNet dai2017-scannet, color images are PCA visualizations of feature volume projected to the input camera view encoded by an RGB-D Gen-NeRF yangfu2023-sceneprior encoder. Note how semantically similar structures acquire similar features.
  • Figure 3: GeFF compresses and refines multi-view observations: (a) single RGB view; (b) coarse 2D CLIP heatmap with query 'toy duck'; (c) 3D heatmap from GeFF with clean boundary reconstructed from compressed latent representation.
  • Figure 4: Qualitative results of GeFF for diverse tasks: (a) real-time update for dynamic person detection; (b) GeFF enables manipulation by parts; (c) entering a narrow doorway; (d) semantics-aware planning with affordance of 'lawns'. The results are animated in the supplementary video. Images in the second row are PCA visualization of first-person GeFF features.