Table of Contents
Fetching ...

Spatially Visual Perception for End-to-End Robotic Learning

Travis Davies, Jiahuan Yan, Xiang Chen, Yu Tian, Yueting Zhuang, Yiqi Huang, Luhui Hu

TL;DR

This paper introduces a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes, and integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data.

Abstract

Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.

Spatially Visual Perception for End-to-End Robotic Learning

TL;DR

This paper introduces a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes, and integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data.

Abstract

Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.

Paper Structure

This paper contains 23 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: An overview of our proposed perception model for robot policy learning, demonstrating the fusion of RGB and depth information to enhance perception robustness. The proposed model effectively handles natural corruptions, such as lighting changes, using multimodal inputs aligned with depth maps.
  • Figure 2: Comparison of RGB frames and depth maps from the wrist camera shows consistent and robust depth estimation despite exposure variations, highlighting its reliability under different lighting conditions.
  • Figure 3: Task Demonstrations: Experimental setup showcasing five key frames to illustrate the progression of each task.
  • Figure 4: Equipment used in this study: (a) the robot arm; (b) blocks for the picksmall and pickbig tasks; (c) cups for the cupstack task; and (d) one of the two cameras used for capturing RGB data.
  • Figure 5: Success rates across exposure levels for each task: (a) CupStack, (b) PickBig, and (c) PickSmall. The proposed model demonstrates stable performance under significant variations in camera exposure.