Table of Contents
Fetching ...

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

Jingyi Tian, Le Wang, Sanping Zhou, Sen Wang, Jiayi Li, Gang Hua

TL;DR

DynaRend introduces a 3D-aware, dynamics-informed representation learned from multi-view RGB-D data through masked future rendering and differentiable volumetric rendering. By projecting scene geometry into triplane features and jointly training reconstruction and future-prediction objectives, it captures geometry, dynamics, and semantics in a unified 3D representation. The pretrained triplane features are fine-tuned with an action decoder to produce action value maps, enabling robust language-conditioned manipulation across diverse tasks and perturbations. Empirical results on RLBench, Colosseum, and real-world experiments show substantial improvements in policy success, generalization to environmental changes, and practical applicability, highlighting the potential of rendering-based future prediction for scalable robot learning.

Abstract

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

TL;DR

DynaRend introduces a 3D-aware, dynamics-informed representation learned from multi-view RGB-D data through masked future rendering and differentiable volumetric rendering. By projecting scene geometry into triplane features and jointly training reconstruction and future-prediction objectives, it captures geometry, dynamics, and semantics in a unified 3D representation. The pretrained triplane features are fine-tuned with an action decoder to produce action value maps, enabling robust language-conditioned manipulation across diverse tasks and perturbations. Empirical results on RLBench, Colosseum, and real-world experiments show substantial improvements in policy success, generalization to environmental changes, and practical applicability, highlighting the potential of rendering-based future prediction for scalable robot learning.

Abstract

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

Paper Structure

This paper contains 49 sections, 7 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Comparison of representation learning paradigms for robot learning. (a) Learning predictive 2D representations hu2024video by forecasting future frames from the current observation to capture future dynamics. (b) Learning semantic or geometric features through reconstruction of static scenes using MAE qian20243d or 3D reconstruction ze2023gnfactor. (c) Our approach leverages differentiable volumetric rendering to jointly learn semantics, geometry, and dynamics in a unified 3D representation.
  • Figure 2: DynaRend framework overview. (a) We reconstruct the point cloud from multi-view RGB-D inputs, encode it with an MLP, and project it onto three orthogonal planes to produce triplane features. (b) We mask a subset of the triplane features and sequentially pass it through a reconstructive network and a predictive network to obtain current and future scene representations. For pretraining, both triplane volumes are rendered into RGB, depth, and semantic maps via volumetric rendering and supervised by corresponding current and future target views. (c) For finetuning, two networks serve as a triplane encoder and are trained with an action decoder on demonstration data.
  • Figure 3: Ablation on mask ratio.
  • Figure 4: Results on Colosseum.
  • Figure 5: Real-world setup and task examples. We evaluate on five manipulation tasks: $\mathtt{Put}$$\mathtt{Item}$$\mathtt{in}$$\mathtt{Drawer}$, $\mathtt{Close}$$\mathtt{Pot}$, $\mathtt{Stack}$$\mathtt{Blocks}$, $\mathtt{Sort}$$\mathtt{Shape}$, $\mathtt{Stack}$$\mathtt{Cups}$.
  • ...and 5 more figures