Table of Contents
Fetching ...

3D Mapping Using a Lightweight and Low-Power Monocular Camera Embedded inside a Gripper of Limbed Climbing Robots

Taku Okawara, Ryo Nishibe, Mao Kasano, Kentaro Uno, Kazuya Yoshida

TL;DR

This work tackles scale ambiguity in monocular 3D mapping for space-roving limbed climbers by fusing monocular SLAM with limb forward kinematics in a factor-graph framework, estimating both the gripper poses $T_{0:t}$ and the global map scale $s$ in real time. The system embeds a lightweight monocular hand-eye camera directly in the gripper, aligned with the gripper axis, and uses scaled monocular-SLAM and limb-kinematics factors to produce metrically scaled terrain maps for autonomous grasping of convex surfaces. Validation includes physics-based simulation and real-world tests, demonstrating an average gripper-palm error of $1.8 \pm 0.7$ mm in simulation and successful autonomous grasps in hardware experiments, underscoring the approach's potential for energy-efficient perception in space missions. The method advances lightweight, power-efficient perception for limbed climbing robots by eliminating the need for RGB-D sensors while maintaining accurate, scale-correct 3D maps and autonomous manipulation capabilities.

Abstract

Limbed climbing robots are designed to explore challenging vertical walls, such as the skylights of the Moon and Mars. In such robots, the primary role of a hand-eye camera is to accurately estimate 3D positions of graspable points (i.e., convex terrain surfaces) thanks to its close-up views. While conventional climbing robots often employ RGB-D cameras as hand-eye cameras to facilitate straightforward 3D terrain mapping and graspable point detection, RGB-D cameras are large and consume considerable power. This work presents a 3D terrain mapping system designed for space exploration using limbed climbing robots equipped with a monocular hand-eye camera. Compared to RGB-D cameras, monocular cameras are more lightweight, compact structures, and have lower power consumption. Although monocular SLAM can be used to construct 3D maps, it suffers from scale ambiguity. To address this limitation, we propose a SLAM method that fuses monocular visual constraints with limb forward kinematics. The proposed method jointly estimates time-series gripper poses and the global metric scale of the 3D map based on factor graph optimization. We validate the proposed framework through both physics-based simulations and real-world experiments. The results demonstrate that our framework constructs a metrically scaled 3D terrain map in real-time and enables autonomous grasping of convex terrain surfaces using a monocular hand-eye camera, without relying on RGB-D cameras. Our method contributes to scalable and energy-efficient perception for future space missions involving limbed climbing robots. See the video summary here: https://youtu.be/fMBrrVNKJfc

3D Mapping Using a Lightweight and Low-Power Monocular Camera Embedded inside a Gripper of Limbed Climbing Robots

TL;DR

This work tackles scale ambiguity in monocular 3D mapping for space-roving limbed climbers by fusing monocular SLAM with limb forward kinematics in a factor-graph framework, estimating both the gripper poses and the global map scale in real time. The system embeds a lightweight monocular hand-eye camera directly in the gripper, aligned with the gripper axis, and uses scaled monocular-SLAM and limb-kinematics factors to produce metrically scaled terrain maps for autonomous grasping of convex surfaces. Validation includes physics-based simulation and real-world tests, demonstrating an average gripper-palm error of mm in simulation and successful autonomous grasps in hardware experiments, underscoring the approach's potential for energy-efficient perception in space missions. The method advances lightweight, power-efficient perception for limbed climbing robots by eliminating the need for RGB-D sensors while maintaining accurate, scale-correct 3D maps and autonomous manipulation capabilities.

Abstract

Limbed climbing robots are designed to explore challenging vertical walls, such as the skylights of the Moon and Mars. In such robots, the primary role of a hand-eye camera is to accurately estimate 3D positions of graspable points (i.e., convex terrain surfaces) thanks to its close-up views. While conventional climbing robots often employ RGB-D cameras as hand-eye cameras to facilitate straightforward 3D terrain mapping and graspable point detection, RGB-D cameras are large and consume considerable power. This work presents a 3D terrain mapping system designed for space exploration using limbed climbing robots equipped with a monocular hand-eye camera. Compared to RGB-D cameras, monocular cameras are more lightweight, compact structures, and have lower power consumption. Although monocular SLAM can be used to construct 3D maps, it suffers from scale ambiguity. To address this limitation, we propose a SLAM method that fuses monocular visual constraints with limb forward kinematics. The proposed method jointly estimates time-series gripper poses and the global metric scale of the 3D map based on factor graph optimization. We validate the proposed framework through both physics-based simulations and real-world experiments. The results demonstrate that our framework constructs a metrically scaled 3D terrain map in real-time and enables autonomous grasping of convex terrain surfaces using a monocular hand-eye camera, without relying on RGB-D cameras. Our method contributes to scalable and energy-efficient perception for future space missions involving limbed climbing robots. See the video summary here: https://youtu.be/fMBrrVNKJfc

Paper Structure

This paper contains 16 sections, 10 equations, 7 figures.

Figures (7)

  • Figure 1: (a) Limbed climbing robot exploring vertical walls such as the skylights of the Moon and Mars. (b) Although a monocular hand-eye camera enables efficient 3D mapping, (c) it inherently suffers from scale ambiguity—resulting in visually consistent 3D maps that lack metric scale information due to the limitations of monocular cameras.
  • Figure 2: Comparison between proposed and previous hand-eye camera mounting strategies. (a) In the proposed design, the monocular camera is embedded such that its optical axis aligns with the gripper axis, enabling accurate perception for grasping. (b) In previous designs (HubRobo uno2021hubrobo, ReachBot chen2024locomotion), the RGB-D camera is mounted obliquely, resulting in a misalignment between the optical and gripper axes.
  • Figure 3: Overview of the proposed framework. To construct the scaled 3D map from the unscaled 3D map, we jointly estimate the 3D map's scale $s$ and the gripper poses $\bm{T}_{0:t}$ by fusing monocular camera-based constraints (scaled monocular SLAM factor) and limb forward kinematics-based constraints (limb kinematics factor) based on factor graph optimization. Finally, we control the limb and its gripper to grasp the graspable point (convex terrain) based on the scaled 3D map.
  • Figure 4: Testbed used for (a) real-world experiment and (b) simulation experiment.
  • Figure 5: Snapshots of our framework in the simulation experiment. The scaled 3D terrain surface was constructed by our SLAM method, and the gripper reached the graspable point extracted from the 3D map.
  • ...and 2 more figures