Table of Contents
Fetching ...

GISR: Geometric Initialization and Silhouette-based Refinement for Single-View Robot Pose and Configuration Estimation

Ivan Bilić, Filip Marić, Fabio Bonsignorio, Ivan Petrović

TL;DR

The paper introduces GISR, a real-time framework that jointly estimates the 6D camera-to-robot pose and the robot configuration from a single RGB image. It combines a geometric initialization module that leverages a differentiable EDM-based pipeline with a silhouette-based refinement module that iteratively updates pose and configuration using a fast silhouette renderer. Training optimizes both configuration and pose losses, while the RM learns to correct initialization-driven errors, achieving a reported runtime around $40$ ms and superior speed-accuracy compared with dense RGB methods. Experiments on Panda-3Cam show strong generalization and competitive performance against state-of-the-art methods, including those requiring ground-truth proprioception. The approach highlights the benefits of integrating geometry priors with efficient silhouette-based refinement for online, dynamic robotics scenarios, and points toward extension to unknown robot kinematics.

Abstract

In autonomous robotics, measurement of the robot's internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose.

GISR: Geometric Initialization and Silhouette-based Refinement for Single-View Robot Pose and Configuration Estimation

TL;DR

The paper introduces GISR, a real-time framework that jointly estimates the 6D camera-to-robot pose and the robot configuration from a single RGB image. It combines a geometric initialization module that leverages a differentiable EDM-based pipeline with a silhouette-based refinement module that iteratively updates pose and configuration using a fast silhouette renderer. Training optimizes both configuration and pose losses, while the RM learns to correct initialization-driven errors, achieving a reported runtime around ms and superior speed-accuracy compared with dense RGB methods. Experiments on Panda-3Cam show strong generalization and competitive performance against state-of-the-art methods, including those requiring ground-truth proprioception. The approach highlights the benefits of integrating geometry priors with efficient silhouette-based refinement for online, dynamic robotics scenarios, and points toward extension to unknown robot kinematics.

Abstract

In autonomous robotics, measurement of the robot's internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose.
Paper Structure (16 sections, 11 equations, 4 figures, 4 tables)

This paper contains 16 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: GISR takes an input RGB image of the robot (left) and outputs an estimate of both the camera-to-robot pose and the configuration of the robot. The corresponding skeleton is projected onto the input image and overlaid (right).
  • Figure 2: System overview. The geometric initialization module(GIM) takes an input RGB image and produces initial estimates of the robot pose and configuration. The refinement module (RM) uses these estimates to generate a corresponding silhouette image, which is fed to the refiner along with the segmented input image to predict an update. This render-and-update process can be repeated, but each iteration requires a forward pass of a deep model (including the update and rendering).
  • Figure 3: Qualitative results for pose and configuration estimation; input image (first row), segmented input image (second row), rendered silhouette of an initial estimate (third row) and a projection of the skeleton reflecting the final estimates (last row).
  • Figure 4: AUC score as a function of training data size for different initialization schemes; ($\blacktriangle$) no prior information, ($\bullet$) initializing scale using 2D keypoints and known robot DH parameters, and ($\bigstar$) initializing configuration, scale, and rotation, which amounts to full use of the GIM.