R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Yuhao Zhang; Wanxi Dong; Yue Shi; Yi Liang; Jingnan Gao; Qiaochu Yang; Yaxing Lyu; Zhixuan Liang; Yibin Liu; Congsheng Xu; Xianda Guo; Wei Sui; Yaohui Jin; Xiaokang Yang; Yanyan Xu; Yao Mu

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao, Qiaochu Yang, Yaxing Lyu, Zhixuan Liang, Yibin Liu, Congsheng Xu, Xianda Guo, Wei Sui, Yaohui Jin, Xiaokang Yang, Yanyan Xu, Yao Mu

Abstract

Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Abstract

Paper Structure (25 sections, 9 equations, 8 figures, 11 tables)

This paper contains 25 sections, 9 equations, 8 figures, 11 tables.

Introduction
Related Work
Preliminary
Methodology
Asynchronous Fast-Slow Collaboration
Temporal Feature Prediction Network
Multi-View Feature Fuser
Experiments
Experimental Setup on Simulation Benchmark
Results on Simulation Benchmark
Ablation Study
Real-world experiments
Conclusion
Simulation Task Description
Tube Insert Task Details
...and 10 more sections

Figures (8)

Figure 1: Key modules and performance of R3DP. Our framework explicitly integrates 3D priors from large-scale foundation models (e.g., VGGT) via an asynchronous fast-slow collaboration mechanism. Overall, R3DP achieves real-time, 3D-aware inference, significantly improving both manipulation success rates and processing frequency.
Figure 2: Overview of the R3DP architecture. R3DP serves as a 3D-aware perception module that seamlessly replaces visual encoders in existing imitation learning frameworks. Within the AFSC module (\ref{['41']}), sparse keyframes are processed by a 3D foundation model (VGGT), while intermediate frames are handled by our TFPNet (\ref{['sec:tfp']}) for real-time temporal reasoning. MVFF module (\ref{['sec:mvff']}) leverages cross-attention with PRoPE to fuse 2D-3D features into consistent multi-view representations for control.
Figure 3: Architecture and training objective of TFPNet. For clarity, we show the unrolled structure for the first two timesteps; in practice, the network is trained over a sequence of four timesteps. TFPNet leverages historical information to augment current observations, enabling 3D-aware control with real-time inference efficiency.
Figure 4: Visualization of depth maps decoded from VGGT features and from our TFPNet-predicted features passed through VGGT’s depth decoder. The close visual agreement indicates that our lightweight TFPNet effectively captures information generated by 3D foundation model in both simulation and real-world experiments.
Figure 5: Real-world experimental platforms and point cloud inputs. We evaluate our method on the ArmBot-Y1 bimanual robot for tasks including Place Shoe and Place Glass Cup, and a single-arm robot for Pick Peach and Stack Bowls. Both platforms are equipped with dual RealSense D435 cameras to generate the ground-truth point clouds used by the DP3 baseline.
...and 3 more figures

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Abstract

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Authors

Abstract

Table of Contents

Figures (8)