Table of Contents
Fetching ...

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, Xiaolong Wang

TL;DR

GNFactor tackles multi-task robotic manipulation from visual observations in unstructured environments by learning a generalizable 3D semantic representation. It combines Generalizable Neural Feature Fields to reconstruct a 3D voxel scene with diffusion-based vision-language embeddings and a Perceiver Transformer to condition decisions on language instructions. The approach demonstrates strong generalization to unseen tasks and scenes with limited demonstrations, outperforming state-of-the-art baselines on RLBench and real-robot experiments. This work highlights the value of integrating 3D semantic representations with language-conditioned policies to enable robust, scalable real-world manipulation.

Abstract

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $\textbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

TL;DR

GNFactor tackles multi-task robotic manipulation from visual observations in unstructured environments by learning a generalizable 3D semantic representation. It combines Generalizable Neural Feature Fields to reconstruct a 3D voxel scene with diffusion-based vision-language embeddings and a Perceiver Transformer to condition decisions on language instructions. The approach demonstrates strong generalization to unseen tasks and scenes with limited demonstrations, outperforming state-of-the-art baselines on RLBench and real-robot experiments. This work highlights the value of integrating 3D semantic representations with language-conditioned policies to enable robust, scalable real-world manipulation.

Abstract

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present , a visual behavior cloning agent for multi-task robotic manipulation with eneralizable eural feature ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model (, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .
Paper Structure (18 sections, 5 equations, 11 figures, 9 tables)

This paper contains 18 sections, 5 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Left: Three camera views used in the real robot setup to reconstruct the feature field generated by Stable Diffusion rombach2022diffusion. We segment the foreground feature for better illustration. Right: Three language-conditioned real robot tasks across two different kitchens.
  • Figure 2: Simulation environments and the real robot setup. We show the RGB observations for our 10 RLBench tasks in Figure (a), the sampled views for GNF in Figure (b), and the real robot setup in Figure (c).
  • Figure 3: Overview of GNFactor. GNFactor takes an RGB-D image as input and encodes it using a voxel encoder to transform it into a feature in deep 3D volume. This volume is then shared by two modules: volumetric rendering (Renderer) and robot action prediction (Perceiver). These two modules are jointly trained, which optimizes the shared features to not only reconstruct vision-language embeddings (Diffusion Feature) and other views (RGB), but also to estimate accurate Q-values ($Q_\text{trans}$, $Q_\text{rot}$, $Q_\text{collide}$, $Q_\text{open}$).
  • Figure 4: Main experiment results. We present the average success rates in both the multi-task and generalization settings across RLBench tasks and real robot tasks. The error bar represents one standard deviation. The number in the bracket denotes the number of tasks.
  • Figure 5: View synthesis of GNFactor in the real world.PSNR is computed for quantitative evaluation. The visualization with the action loss is relatively blurred compared to that without the action loss. The noisy rendering is mainly because, in inference, we do not optimize per-step for rendering but just perform one feedforward to obtain the feature.
  • ...and 6 more figures