Table of Contents
Fetching ...

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

TL;DR

RoboPEPP tackles the challenge of estimating robot poses and unknown joint angles from monocular images by introducing embedding-predictive pre-training with joint masking to imbue the encoder with the robot's physical structure. A downstream Joint Net and Keypoint Net then predict joint angles and 2D keypoints, with pose recovered via EPnP using filtered correspondences; sim-to-real fine-tuning further boosts real-world performance. The approach yields state-of-the-art accuracy and occlusion robustness while maintaining real-time inference, demonstrating the value of fusing explicit physical priors into self-supervised encoder training. This method has practical implications for collaborative robotics and human-robot interaction in cluttered or partially occluded environments.

Abstract

Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot's physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot's physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot's joints and pre-train an encoder-predictor model to infer the joints' embeddings from surrounding unmasked regions, enhancing the encoder's understanding of the robot's physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

TL;DR

RoboPEPP tackles the challenge of estimating robot poses and unknown joint angles from monocular images by introducing embedding-predictive pre-training with joint masking to imbue the encoder with the robot's physical structure. A downstream Joint Net and Keypoint Net then predict joint angles and 2D keypoints, with pose recovered via EPnP using filtered correspondences; sim-to-real fine-tuning further boosts real-world performance. The approach yields state-of-the-art accuracy and occlusion robustness while maintaining real-time inference, demonstrating the value of fusing explicit physical priors into self-supervised encoder training. This method has practical implications for collaborative robotics and human-robot interaction in cluttered or partially occluded environments.

Abstract

Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot's physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot's physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot's joints and pre-train an encoder-predictor model to infer the joints' embeddings from surrounding unmasked regions, enhancing the encoder's understanding of the robot's physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.

Paper Structure

This paper contains 28 sections, 5 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Comparison of an existing robot pose estimation method hpe with our RoboPEPP framework. RoboPEPP integrates joint masking-based pre-training (b.1) to enhance the encoder's grasp of the robot's physical model, combined with downstream networks, and keypoint filtering (b.2) to achieve high accuracy.
  • Figure 2: Overview of the RoboPEPP framework for robot pose and joint angle estimation. (a) Joint regions are masked to pre-train an encoder-predictor pair using an embedding predictive architecture. (b) The pre-trained encoder-predictor network is fine-tuned for robot pose estimation with Joint and Keypoint Prediction networks, using random masking during training to enhance occlusion robustness. During evaluation, keypoints are filtered, and a PnP algorithm estimates the robot’s pose from the filtered 2D-3D correspondences.
  • Figure 3: Joint Net: A global average pooling layer aggregates the patch embeddings, $v_1, \dots, v_M$, into $v_g$, which is then iteratively refined using an MLP to estimate the joint angles.
  • Figure 4: The examples show predicted heatmaps for Joint 5 and the End-Effector overlaid on the original image. The End-Effector, being positioned outside the field of view, produces noisy heatmaps with lower confidence (measured by peak values). Heatmap pixel values are normalized for better visualization. The green arrows highlight the peak values for Joint 5 for visual clarity.
  • Figure 5: Qualitative Comparison on Panda Photo (Example 1) and Occlusion (Example 2 and 3) datasets: Predicted poses and joint angles are used to generate a mesh overlaid on the original image, where closer alignment indicates greater accuracy. Highlighted rectangles indicate regions where other methods' meshes misalign, while RoboPEPP achieves high precision.
  • ...and 10 more figures