Real-time Holistic Robot Pose Estimation with Unknown States

Shikun Ban; Juling Fan; Xiaoxuan Ma; Wentao Zhu; Yu Qiao; Yizhou Wang

Real-time Holistic Robot Pose Estimation with Unknown States

Shikun Ban, Juling Fan, Xiaoxuan Ma, Wentao Zhu, Yu Qiao, Yizhou Wang

TL;DR

This work tackles monocular holistic robot pose estimation when internal joint states are unknown. It introduces a modular, end-to-end framework composed of DepthNet, JointNet, RotationNet, and KeypointNet that predicts camera-to-robot rotation, joint states, root depth, and root-relative 3D keypoints, all in a single feed-forward pass. DepthNet disentangles depth from camera intrinsics, KeypointNet provides pixel-aligned 3D keypoints, and differentiable forward kinematics fuse these estimates into accurate 3D poses; self-supervision further enhances sim-to-real generalization. The approach achieves state-of-the-art accuracy while delivering a $12\times$ speedup over iterative Render-and-Compare methods, enabling real-time holistic robot pose estimation for diverse morphologies and real-world scenarios.

Abstract

Estimating robot pose from RGB images is a crucial problem in computer vision and robotics. While previous methods have achieved promising performance, most of them presume full knowledge of robot internal states, e.g. ground-truth robot joint angles. However, this assumption is not always valid in practical situations. In real-world applications such as multi-robot collaboration or human-robot interaction, the robot joint states might not be shared or could be unreliable. On the other hand, existing approaches that estimate robot pose without joint state priors suffer from heavy computation burdens and thus cannot support real-time applications. This work introduces an efficient framework for real-time robot pose estimation from RGB images without requiring known robot states. Our method estimates camera-to-robot rotation, robot state parameters, keypoint locations, and root depth, employing a neural network module for each task to facilitate learning and sim-to-real transfer. Notably, it achieves inference in a single feed-forward pass without iterative optimization. Our approach offers a 12-time speed increase with state-of-the-art accuracy, enabling real-time holistic robot pose estimation for the first time. Code and models are available at https://github.com/Oliverbansk/Holistic-Robot-Pose-Estimation.

Real-time Holistic Robot Pose Estimation with Unknown States

TL;DR

speedup over iterative Render-and-Compare methods, enabling real-time holistic robot pose estimation for diverse morphologies and real-world scenarios.

Abstract

Paper Structure (25 sections, 16 equations, 9 figures, 11 tables)

This paper contains 25 sections, 16 equations, 9 figures, 11 tables.

Introduction
Related Work
Hand-eye calibration
Image-based robot pose estimation
Method
Overview
Architecture
DepthNet
JointNet
RotationNet
KeypointNet
Training loss
Ground-truth supervision
Self-supervision
Experiments
...and 10 more sections

Figures (9)

Figure 1: The majority of previous robot pose estimation methods assume known robot joint states and focus solely on estimating the camera-to-robot pose, i.e. camera-to-robot rotation and translation. In contrast, the holistic robot pose estimation problem requires estimating both joint states and the camera-to-robot pose, given only an RGB image without known joint states. For holistic robot pose estimation, RoboPose labbe2021robopose uses costly test-time optimization (Render-and-Compare) to iteratively refine the predictions. In contrast, our feed-forward method achieves state-of-the-art accuracy with a $12\times$ speed boost.
Figure 2: Framework overview. The JointNet and the RotationNet regress joint state parameters $\mathbf{q}$ and camera-to-robot rotation $\mathbf{R}$, respectively. The KeypointNet estimates root-relative 3D keypoint locations $\mathbf{P}^{r}$. The DepthNet's estimation of root depth $d$ is combined with $\mathbf{P}^{r}$ to acquire absolute 3D keypoint locations $\mathbf{P}^{'}$ and camera-to-robot translation $\mathbf{t}$. Joint state parameters $\mathbf{q}$, rotation $\mathbf{R}$ and translation $\mathbf{t}$ are used to compute 3D keypoint locations $\mathbf{P}$ via forward-kinematics.
Figure 3: Comparison of ADD distributions on the real-world datasets between our approach and RoboPose labbe2021robopose. The y-axis represents the accuracy of our estimation at different ADD thresholds.
Figure 4: Qualitative comparison between our method and RoboPose labbe2021robopose on both real and synthetic datasets.
Figure 5: Ablation studies on network modules and self-supervision.
...and 4 more figures

Real-time Holistic Robot Pose Estimation with Unknown States

TL;DR

Abstract

Real-time Holistic Robot Pose Estimation with Unknown States

Authors

TL;DR

Abstract

Table of Contents

Figures (9)