Table of Contents
Fetching ...

CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera

Jingpei Lu, Zekai Liang, Tristin Xie, Florian Ritcher, Shan Lin, Sainan Liu, Michael C. Yip

TL;DR

This work proposes a novel framework capable of estimating the robot pose with partially visible robot manipulators that leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions.

Abstract

Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.

CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera

TL;DR

This work proposes a novel framework capable of estimating the robot pose with partially visible robot manipulators that leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions.

Abstract

Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.
Paper Structure (14 sections, 7 equations, 5 figures, 3 tables)

This paper contains 14 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: In real-world robot manipulation scenarios, the camera does not always capture all the robot links, and the visibility of robot links changes from time to time. Our method leverages the limited available visual features within the camera view and achieves state-of-the-art performance on robot pose estimation.
  • Figure 2: Sample images are from DROID robot learning dataset khazatsky2024droid. Often, only certain parts of the robot are visible in the camera view, and sometimes none of them are visible.
  • Figure 3: Model inference pipeline. CtRNet-X estimates camera-to-robot transform given the images and the corresponding joint angles. The framework uses a set of structured prompts and the fine-tuned CLIP model to detect which robot parts are visible and dynamically adjusts the keypoint selection. The keypoint detector outputs 2D keypoints, and the corresponding 3D keypoints are obtained from the robot forward kinematics. Finally, a PnP solver is utilized to estimate the camera-to-robot transformation matrix given the selected keypoint correspondence.
  • Figure 4: Qualitative results of our method on the real-world manipulation dataset DROID khazatsky2024droid. The first row is the raw image frames, the second is the robot masks rendered based on the estimation of the original CtRNet (orange), and the third row is the robot masks rendered based on the estimation of the CtRNet-X (blue). As shown above, CtRNet fails under real-world conditions whereas our method exhibits greater generalizability.
  • Figure 5: Qualitative results on Panda manipulation dataset. The first row is rendered robot masks using ground-truth extrinsic calibration (green) and the second row is the rendered robot masks using the pose from CtRNet-X (blue).