Table of Contents
Fetching ...

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin, Yan Qin, Yixuan Ye, Jiaming Liang, Renjing Xu

TL;DR

Robotic manipulation benefits from 3D representations that capture temporal dynamics, but existing 3D pretraining often lacks motion modeling and relies on reconstruction. AFRO introduces dynamics-aware, action-free 3D pretraining by embedding latent actions and diffusion-based forward dynamics in latent space, coupled with feature differencing and inverse-consistency to prevent shortcut learning. The approach, trained without action labels or scene reconstruction, yields state-of-the-art performance on both simulated and real-world manipulation tasks and scales effectively with data and task diversity, including large-scale out-of-domain pretraining. These results highlight the practical potential of dynamics-grounded 3D representations for generalizable and robust robotic manipulation.

Abstract

Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: https://kolakivy.github.io/AFRO/

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

TL;DR

Robotic manipulation benefits from 3D representations that capture temporal dynamics, but existing 3D pretraining often lacks motion modeling and relies on reconstruction. AFRO introduces dynamics-aware, action-free 3D pretraining by embedding latent actions and diffusion-based forward dynamics in latent space, coupled with feature differencing and inverse-consistency to prevent shortcut learning. The approach, trained without action labels or scene reconstruction, yields state-of-the-art performance on both simulated and real-world manipulation tasks and scales effectively with data and task diversity, including large-scale out-of-domain pretraining. These results highlight the practical potential of dynamics-grounded 3D representations for generalizable and robust robotic manipulation.

Abstract

Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: https://kolakivy.github.io/AFRO/

Paper Structure

This paper contains 31 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: (a) The relationship between robot manipulation in real space and its abstraction in latent space. (b) Our framework learns dynamics-aware 3D visual features in latent space, replacing static representations without relying on explicit action labels or reconstruction. (c) AFRO achieves higher success rates and stronger generalization than baseline methods in both simulation and real-world tasks.
  • Figure 2: Overall framework of our method. (a) Predict Future: Given point clouds $\mathcal{P}_{t}$ and $\mathcal{P}_{t+k}$, the online encoder $f_{\phi}$ encodes both to obtain $\mathbf{z}_{t}$ and $\mathbf{z}_{t+k}$. The inverse dynamics model $g_{\psi}$ takes the difference $(\mathbf{z}_{t+k}-\mathbf{z}_{t})$ to infer the forward latent action $\boldsymbol{\alpha}_{t\rightarrow t+k}$. The target encoder $f_{\xi}$ (EMA-updated) encodes only $\mathcal{P}_{t+k}$ to yield the teacher target $\tilde{\mathbf{z}}_{t+k}$. The forward dynamics model $h_{\theta}$ predicts the future feature by mapping $(\mathbf{z}_{t}, \boldsymbol{\alpha}_{t\rightarrow t+k}) \mapsto \hat{\mathbf{z}}_{t+k}$ and aligning it to $\tilde{\mathbf{z}}_{t+k}$, which explicitly drives $f_{\phi}$ to learn dynamics-aware representations. (b) Predict History: Different from (a), using the difference $(\mathbf{z}_{t}-\mathbf{z}_{t+k})$, $g_{\psi}$ infers the backward latent action $\boldsymbol{\alpha}_{t+k\rightarrow t}$; $f_{\xi}$ encodes only $\mathcal{P}_{t}$ to obtain $\tilde{\mathbf{z}}_{t}$; then $h_{\theta}$ maps $(\mathbf{z}_{t+k}, \boldsymbol{\alpha}_{t+k\rightarrow t}) \mapsto \hat{\mathbf{z}}_{t}$ and aligns it with $\tilde{\mathbf{z}}_{t}$. Other steps are symmetric to (a) and omitted for brevity. (c) Notation Summary: $\mathcal{P}_t$ point cloud; $f_{\phi}$ online encoder; $f_{\xi}$ target encoder; $\mathbf{z}$ student feature; $\tilde{\mathbf{z}}$ teacher feature; $g_{\psi}$ inverse dynamics model; $\boldsymbol{\alpha}$ latent action; $h_{\theta}$ forward dynamics model.
  • Figure 3: Given the differential feature $(\mathbf{z}_{t+k} - \mathbf{z}_{t})$ or $(\mathbf{z}_{t} - \mathbf{z}_{t+k})$, the inverse dynamics model $g_{\psi}$ passes it through two stacked Linear–GELU layers followed by a final Linear projection to produce the latent action $\boldsymbol{\alpha}_{t\rightarrow t+k}$ or $\boldsymbol{\alpha}_{t+k\rightarrow t}$, which encodes the implicit motion between consecutive states.
  • Figure 4: Architecture of the Forward Dynamic Model. An AdaLN-Zero diffusion transformer $h_{\theta}$ conditions on $\mathbf{z}_{t}$, latent action $\boldsymbol{\alpha}_{t\rightarrow t+k}$, and timestep $\tau$ encoded via MLPs and concatenation, which modulate LayerNorm through adaptive scale–shift. Attention and projector modules then denoise the latent to reconstruct the future representation $\hat{\mathbf{z}}_{t+k}$.
  • Figure 5: Scaling across task domains. Comparison between in-domain (solid) and multi-domain (hatched) visual pretraining on four tasks. AFRO consistently benefits from multi-domain pretraining and even reaches $100\%$ success on Peg Unplug Side, whereas static and dynamic baselines show smaller or inconsistent gains.
  • ...and 5 more figures