Table of Contents
Fetching ...

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

Ge Yan, Yueh-Hua Wu, Xiaolong Wang

TL;DR

DNAct introduces a unified framework that combines NeRF-based neural rendering pre-training to distill 2D foundation-model semantics into a 3D scene representation with a frozen 3D encoder, and diffusion-guided feature learning to capture multi-modal dynamics across tasks. By reframing robotic manipulation as keyframe prediction and fusing 3D semantic features with diffusion-conditioned representations, DNAct achieves robust generalization from limited demonstrations. The method delivers substantial improvements over state-of-the-art NeRF-based baselines in both simulated RLBench tasks and real-robot experiments, while maintaining a compact model and faster inference. Overall, DNAct advances multi-task robotic manipulation by integrating 3D semantic priors with diffusion-based multi-modal learning to enhance robustness and generalization in unseen objects and arrangements.

Abstract

This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications to challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Project website: dnact.github.io.

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

TL;DR

DNAct introduces a unified framework that combines NeRF-based neural rendering pre-training to distill 2D foundation-model semantics into a 3D scene representation with a frozen 3D encoder, and diffusion-guided feature learning to capture multi-modal dynamics across tasks. By reframing robotic manipulation as keyframe prediction and fusing 3D semantic features with diffusion-conditioned representations, DNAct achieves robust generalization from limited demonstrations. The method delivers substantial improvements over state-of-the-art NeRF-based baselines in both simulated RLBench tasks and real-robot experiments, while maintaining a compact model and faster inference. Overall, DNAct advances multi-task robotic manipulation by integrating 3D semantic priors with diffusion-based multi-modal learning to enhance robustness and generalization in unseen objects and arrangements.

Abstract

This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications to challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Project website: dnact.github.io.
Paper Structure (16 sections, 7 equations, 9 figures, 7 tables)

This paper contains 16 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: We propose DNAct, a novel multi-task object manipulation approach that utilizes knowledge distillation and diffusion training to obtain semantic-aware and multi-modal representations. We visualize our pre-trained semantic-aware representations, demonstrating that they accurately capture the semantics in both simulated and real-world tasks by leveraging neural rendering for pre-training.
  • Figure 2: Similarity in multi-task demonstrations. We observed that trajectories in multi-task datasets originate from varied tasks, but they exhibit similarity. This is because similar operations and sub-trajectories are often employed across different tasks.
  • Figure 3: The diagram provides an overview of the proposed DNAct. The upper section of the figure represents the pre-training component, which is frozen during the subsequent training phase, as indicated by the snowflake icon. The area shaded in gray does not participate in this training phase, with only the 3D encoder being utilized to provide generic semantic features. The lower section of the figure corresponds to the training phase, where the diffusion training and the policy MLP are jointly optimized. $a_0^T$ suggests an action sequence of length $T$ and $\mathcal{N}$ is the normal distribution.
  • Figure 4: The ten RLBench and five real robot tasks in our experiments.
  • Figure 5: Average success rates across 3 seeds on RLBench. The error bar shows one standard deviation.
  • ...and 4 more figures