Table of Contents
Fetching ...

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

Xincheng Pang, Wenke Xia, Zhigang Wang, Bin Zhao, Di Hu, Dong Wang, Xuelong Li

TL;DR

DI^2 tackles the RGB-only limitation in robotic manipulation by finetuning with RGB-D trajectories and deploying with RGB inputs. It introduces a Depth Completion Module (DCM) that predicts depth features $ hat{f}^{depth}_t = { m DCM}(f^{rgb}_t, P)$ using a Perceiver Resampler with $k$ learnable tokens $P$, and a Depth-Aware Codebook (DAC) that discretizes depth features via a codebook $Z in mathbb{R}^{N\times d}$ to yield $ ilde{f}^{depth}_t = \mathbf{q}(f^{depth}_t|Z)$; this reduces noise and accumulation of errors over time. Training proceeds in three stages (warm-up, alignment with $L_{dcm}$, and codebook training with CVQ-VAE-inspired updates) to integrate depth priors while keeping RGB-based deployment lightweight. Experiments on the LIBERO benchmark and real-world robot tasks show that DI^2 improves depth-informed decision making and maintains strong RGB-only performance, enabling robust fine-grained manipulation without depth sensors during deployment.

Abstract

3D perception ability is crucial for generalizable robotic manipulation. While recent foundation models have made significant strides in perception and decision-making with RGB-based input, their lack of 3D perception limits their effectiveness in fine-grained robotic manipulation tasks. To address these limitations, we propose a Depth Information Injection ($\bold{DI}^{\bold{2}}$) framework that leverages the RGB-Depth modality for policy fine-tuning, while relying solely on RGB images for robust and efficient deployment. Concretely, we introduce the Depth Completion Module (DCM) to extract the spatial prior knowledge related to depth information and generate virtual depth information from RGB inputs to aid policy deployment. Further, we propose the Depth-Aware Codebook (DAC) to eliminate noise and reduce the cumulative error from the depth prediction. In the inference phase, this framework employs RGB inputs and accurately predicted depth data to generate the manipulation action. We conduct experiments on simulated LIBERO environments and real-world scenarios, and the experiment results prove that our method could effectively enhance the pre-trained RGB-based policy with 3D perception ability for robotic manipulation. The website is released at https://gewu-lab.github.io/DepthHelps-IROS2024.

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

TL;DR

DI^2 tackles the RGB-only limitation in robotic manipulation by finetuning with RGB-D trajectories and deploying with RGB inputs. It introduces a Depth Completion Module (DCM) that predicts depth features using a Perceiver Resampler with learnable tokens , and a Depth-Aware Codebook (DAC) that discretizes depth features via a codebook to yield ; this reduces noise and accumulation of errors over time. Training proceeds in three stages (warm-up, alignment with , and codebook training with CVQ-VAE-inspired updates) to integrate depth priors while keeping RGB-based deployment lightweight. Experiments on the LIBERO benchmark and real-world robot tasks show that DI^2 improves depth-informed decision making and maintains strong RGB-only performance, enabling robust fine-grained manipulation without depth sensors during deployment.

Abstract

3D perception ability is crucial for generalizable robotic manipulation. While recent foundation models have made significant strides in perception and decision-making with RGB-based input, their lack of 3D perception limits their effectiveness in fine-grained robotic manipulation tasks. To address these limitations, we propose a Depth Information Injection () framework that leverages the RGB-Depth modality for policy fine-tuning, while relying solely on RGB images for robust and efficient deployment. Concretely, we introduce the Depth Completion Module (DCM) to extract the spatial prior knowledge related to depth information and generate virtual depth information from RGB inputs to aid policy deployment. Further, we propose the Depth-Aware Codebook (DAC) to eliminate noise and reduce the cumulative error from the depth prediction. In the inference phase, this framework employs RGB inputs and accurately predicted depth data to generate the manipulation action. We conduct experiments on simulated LIBERO environments and real-world scenarios, and the experiment results prove that our method could effectively enhance the pre-trained RGB-based policy with 3D perception ability for robotic manipulation. The website is released at https://gewu-lab.github.io/DepthHelps-IROS2024.
Paper Structure (17 sections, 8 equations, 5 figures, 4 tables)

This paper contains 17 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We propose the Depth Information Injection framework to inject the spatial prior knowledge from the depth information into the RGB-based policy.
  • Figure 2: Overview of our framework. denotes the model parameters are updated, while indicates the model parameters are frozen. During training, we use the collected depth image to train the Depth Completion Module and Depth-Aware Codebook. During inference, we use the Depth Completion Module together with the RGB token to predict the depth token.
  • Figure 3: Qualitative results. This figure illustrates the spatial position perception capability enabled by depth features. The top row shows the effects of the model without depth features. The bottom row shows the results of our method, which can complete the depth features using the depth completion module.
  • Figure 4: Action Prediction Error. We evaluate the action sequences generated from predicted depth data against those derived from actual depth images and calculate the Euclidean distance between the actions at corresponding time steps.
  • Figure 5: Illustration of the real-world scene and the four tasks evaluated in our experiments.