Learning Generalizable Feature Fields for Mobile Manipulation
Ri-Zhao Qiu, Yafei Hu, Yuchen Song, Ge Yang, Yang Fu, Jianglong Ye, Jiteng Mu, Ruihan Yang, Nikolay Atanasov, Sebastian Scherer, Xiaolong Wang
TL;DR
GeFF introduces a scene-level generalizable feature field that unifies navigation and manipulation for mobile robots. It pre-trains a neural scene encoder via novel-view synthesis and aligns its latent geometry and semantics with language through CLIP-based distillation, resulting in real-time open-world perception. The approach supports both object- and part-level manipulation and semantics-aware navigation, outperforming existing baselines in runtime and storage-accuracy trade-offs. Demonstrated on a quadruped robot with a manipulator, GeFF offers zero-shot capabilities and practical impact for open-world robotic perception and manipulation.
Abstract
An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.
