Table of Contents
Fetching ...

RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation

Boshi An, Yiran Geng, Kai Chen, Xiaoqi Li, Qi Dou, Hao Dong

TL;DR

The paper tackles robust robotic manipulation using only RGB inputs by introducing an active, RGB-only perception framework with an eye-on-hand camera. It decouples manipulation into three coordinated modules: Global Scheduling (RL-based view planning), Active Perception (kip-guided multi-view pose estimation), and Impedance-based Manipulation (closed-loop control), achieving accurate 6D pose estimation from monocular images. Through domain randomization and kinematics-guided fusion, the approach demonstrates state-of-the-art performance in simulation and succeeds in real-world experiments without reliance on point-cloud data. This work advances practical RGB-only perception for manipulation and lays groundwork for broader RGB-centric perception research in robotics.

Abstract

Robotic manipulation requires accurate perception of the environment, which poses a significant challenge due to its inherent complexity and constantly changing nature. In this context, RGB image and point-cloud observations are two commonly used modalities in visual-based robotic manipulation, but each of these modalities have their own limitations. Commercial point-cloud observations often suffer from issues like sparse sampling and noisy output due to the limits of the emission-reception imaging principle. On the other hand, RGB images, while rich in texture information, lack essential depth and 3D information crucial for robotic manipulation. To mitigate these challenges, we propose an image-only robotic manipulation framework that leverages an eye-on-hand monocular camera installed on the robot's parallel gripper. By moving with the robot gripper, this camera gains the ability to actively perceive object from multiple perspectives during the manipulation process. This enables the estimation of 6D object poses, which can be utilized for manipulation. While, obtaining images from more and diverse viewpoints typically improves pose estimation, it also increases the manipulation time. To address this trade-off, we employ a reinforcement learning policy to synchronize the manipulation strategy with active perception, achieving a balance between 6D pose accuracy and manipulation efficiency. Our experimental results in both simulated and real-world environments showcase the state-of-the-art effectiveness of our approach. %, which, to the best of our knowledge, is the first to achieve robust real-world robotic manipulation through active pose estimation. We believe that our method will inspire further research on real-world-oriented robotic manipulation.

RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation

TL;DR

The paper tackles robust robotic manipulation using only RGB inputs by introducing an active, RGB-only perception framework with an eye-on-hand camera. It decouples manipulation into three coordinated modules: Global Scheduling (RL-based view planning), Active Perception (kip-guided multi-view pose estimation), and Impedance-based Manipulation (closed-loop control), achieving accurate 6D pose estimation from monocular images. Through domain randomization and kinematics-guided fusion, the approach demonstrates state-of-the-art performance in simulation and succeeds in real-world experiments without reliance on point-cloud data. This work advances practical RGB-only perception for manipulation and lays groundwork for broader RGB-centric perception research in robotics.

Abstract

Robotic manipulation requires accurate perception of the environment, which poses a significant challenge due to its inherent complexity and constantly changing nature. In this context, RGB image and point-cloud observations are two commonly used modalities in visual-based robotic manipulation, but each of these modalities have their own limitations. Commercial point-cloud observations often suffer from issues like sparse sampling and noisy output due to the limits of the emission-reception imaging principle. On the other hand, RGB images, while rich in texture information, lack essential depth and 3D information crucial for robotic manipulation. To mitigate these challenges, we propose an image-only robotic manipulation framework that leverages an eye-on-hand monocular camera installed on the robot's parallel gripper. By moving with the robot gripper, this camera gains the ability to actively perceive object from multiple perspectives during the manipulation process. This enables the estimation of 6D object poses, which can be utilized for manipulation. While, obtaining images from more and diverse viewpoints typically improves pose estimation, it also increases the manipulation time. To address this trade-off, we employ a reinforcement learning policy to synchronize the manipulation strategy with active perception, achieving a balance between 6D pose accuracy and manipulation efficiency. Our experimental results in both simulated and real-world environments showcase the state-of-the-art effectiveness of our approach. %, which, to the best of our knowledge, is the first to achieve robust real-world robotic manipulation through active pose estimation. We believe that our method will inspire further research on real-world-oriented robotic manipulation.
Paper Structure (26 sections, 3 equations, 5 figures, 3 tables)

This paper contains 26 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An eye-on-hand camera captures multiple RGB images to estimate the object pose in the manipulation process
  • Figure 2: In our pipeline, the Global Scheduling Policy serves as a high-level decision making policy to schedule Active Perception Module and Manipulation Module. The Active Perception Module learns to perceive the environment to predict pose information with the help of a pre-trained segmentation model (SAM kirillov2023segment). The Manipulation Module is used to complete the manipulation task through impedance control.
  • Figure 3: Category-level object pose estimation results for handles of different cabinets in the simulator (the first two rows) and the real world (the bottom two rows).
  • Figure 4: Performance of our method under different number of views.
  • Figure 5: Our method under different $\alpha$ for balancing A&E. The x-axis indicates the value of $\alpha$, the red curve corresponds the error of pose estimation and the blue curve is the average moving distance during manipulation. The dot and bar are the mean and standard deviation (respectively) over 5 differently evaluated policies.