Table of Contents
Fetching ...

D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement

Yixuan Wang, Mingtong Zhang, Zhuoran Li, Tarik Kelestemur, Katherine Driggs-Campbell, Jiajun Wu, Li Fei-Fei, Yunzhu Li

TL;DR

D$^3$Fields introduces a 3D, dynamic, semantic implicit descriptor that, through multi-view fusion of visual foundation models, maps arbitrary 3D coordinates to distance, semantic features, and instance probabilities without per-scene training. It enables zero-shot rearrangement by aligning current workspace descriptors with 2D goal images via differentiable fusion and a learned dynamics model used in MPC planning. The approach demonstrates strong generalization across objects, styles, and domains, and outperforms state-of-the-art implicit 3D representations in both efficiency and effectiveness. This work advances robotic manipulation by providing a flexible, goal-image-driven interface for zero-shot manipulation in real-world and simulated settings.

Abstract

Scene representation is a crucial design choice in robotic manipulation systems. An ideal representation is expected to be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields -- dynamic 3D descriptor fields. These fields are implicit 3D representations that take in 3D points and output semantic features and instance masks. They can also capture the dynamics of the underlying 3D environments. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from visual foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to rearrangement tasks in a zero-shot manner. Through extensive evaluation in real worlds and simulations, we demonstrate that D$^3$Fields are effective for zero-shot generalizable rearrangement tasks. We also compare D$^3$Fields with state-of-the-art implicit 3D representations and show significant improvements in effectiveness and efficiency.

D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement

TL;DR

DFields introduces a 3D, dynamic, semantic implicit descriptor that, through multi-view fusion of visual foundation models, maps arbitrary 3D coordinates to distance, semantic features, and instance probabilities without per-scene training. It enables zero-shot rearrangement by aligning current workspace descriptors with 2D goal images via differentiable fusion and a learned dynamics model used in MPC planning. The approach demonstrates strong generalization across objects, styles, and domains, and outperforms state-of-the-art implicit 3D representations in both efficiency and effectiveness. This work advances robotic manipulation by providing a flexible, goal-image-driven interface for zero-shot manipulation in real-world and simulated settings.

Abstract

Scene representation is a crucial design choice in robotic manipulation systems. An ideal representation is expected to be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce DFields -- dynamic 3D descriptor fields. These fields are implicit 3D representations that take in 3D points and output semantic features and instance masks. They can also capture the dynamics of the underlying 3D environments. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from visual foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to rearrangement tasks in a zero-shot manner. Through extensive evaluation in real worlds and simulations, we demonstrate that DFields are effective for zero-shot generalizable rearrangement tasks. We also compare DFields with state-of-the-art implicit 3D representations and show significant improvements in effectiveness and efficiency.
Paper Structure (14 sections, 9 equations, 7 figures)

This paper contains 14 sections, 9 equations, 7 figures.

Figures (7)

  • Figure 1: D$^3$Fields Representation and Application to Zero-Shot Rearrangement Tasks. D$^3$Fields take in multi-view RGBD images and encode semantic features and instance masks using foundational models. The descriptor fields visualized in the bottom left using Principal Component Analysis (PCA) demonstrate consistent features across instances. We use our representation for rearrangement tasks given 2D goal images with diverse instances and styles in a zero-shot manner. We address pick-and-place tasks such as shoe organization and tasks requiring dynamic modeling like collecting debris. We also show that our framework can accomplish 3D manipulation and compositional task specification in the table organization task.
  • Figure 2: Overview of the Proposed Framework. (a) Multi-view RGBD observations are first processed by foundation models to obtain the feature volume $\mathcal{W}$ The implicit function $\mathcal{F}$ takes in arbitrary 3D points and outputs corresponding distance $d$, semantic features $\mathbf{f}$, and instance probability $\mathbf{p}$. (b) Through marching cubes, we could reconstruct the mesh from the implicit signed distance function. Since our representation also encodes instances masks and semantic features for evaluated 3D points, we can construct meshes for the mask field and descriptor field as well. (c) Given a 2D goal image, we use foundation models to extract the descriptor map. Then we correspond 3D features to 2D features and define the planning cost based on the correspondence.
  • Figure 3: Notation Illustration.$r_i$ is the distance between a 3D point $\mathbf{x}$ and camera $i$, and $r'_i$ is the interpolated depth from the depth image.
  • Figure 4: Object Set in Our Experiments. This figure shows diverse objects used in our experiments, expanding over 10 object types.
  • Figure 5: Correspondence Qualitative Comparison. We select the pixel from the source image, obtain the associated DINOv2 feature, and visualize the correspondence heatmap on the reconstructed mesh. (a) Our representation reconstructs clear mesh and corresponds from the source image to semantically similar 3D areas. (b) F3RM shen2023distilled could construct a reasonable mesh in the shoe scene and establish rough correspondences, but fails in other scenes. (c) Only trained on a small dataset, FeatureNeRF ye2023featurenerf fails to generalize to novel scenes. The reconstructed meshes are out of camera view, and the correspondence quality is poor.
  • ...and 2 more figures