3D Mapping Using a Lightweight and Low-Power Monocular Camera Embedded inside a Gripper of Limbed Climbing Robots
Taku Okawara, Ryo Nishibe, Mao Kasano, Kentaro Uno, Kazuya Yoshida
TL;DR
This work tackles scale ambiguity in monocular 3D mapping for space-roving limbed climbers by fusing monocular SLAM with limb forward kinematics in a factor-graph framework, estimating both the gripper poses $T_{0:t}$ and the global map scale $s$ in real time. The system embeds a lightweight monocular hand-eye camera directly in the gripper, aligned with the gripper axis, and uses scaled monocular-SLAM and limb-kinematics factors to produce metrically scaled terrain maps for autonomous grasping of convex surfaces. Validation includes physics-based simulation and real-world tests, demonstrating an average gripper-palm error of $1.8 \pm 0.7$ mm in simulation and successful autonomous grasps in hardware experiments, underscoring the approach's potential for energy-efficient perception in space missions. The method advances lightweight, power-efficient perception for limbed climbing robots by eliminating the need for RGB-D sensors while maintaining accurate, scale-correct 3D maps and autonomous manipulation capabilities.
Abstract
Limbed climbing robots are designed to explore challenging vertical walls, such as the skylights of the Moon and Mars. In such robots, the primary role of a hand-eye camera is to accurately estimate 3D positions of graspable points (i.e., convex terrain surfaces) thanks to its close-up views. While conventional climbing robots often employ RGB-D cameras as hand-eye cameras to facilitate straightforward 3D terrain mapping and graspable point detection, RGB-D cameras are large and consume considerable power. This work presents a 3D terrain mapping system designed for space exploration using limbed climbing robots equipped with a monocular hand-eye camera. Compared to RGB-D cameras, monocular cameras are more lightweight, compact structures, and have lower power consumption. Although monocular SLAM can be used to construct 3D maps, it suffers from scale ambiguity. To address this limitation, we propose a SLAM method that fuses monocular visual constraints with limb forward kinematics. The proposed method jointly estimates time-series gripper poses and the global metric scale of the 3D map based on factor graph optimization. We validate the proposed framework through both physics-based simulations and real-world experiments. The results demonstrate that our framework constructs a metrically scaled 3D terrain map in real-time and enables autonomous grasping of convex terrain surfaces using a monocular hand-eye camera, without relying on RGB-D cameras. Our method contributes to scalable and energy-efficient perception for future space missions involving limbed climbing robots. See the video summary here: https://youtu.be/fMBrrVNKJfc
