Table of Contents
Fetching ...

CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation

Haoyu Xi, Mingao Tan, Xinming Zhang, Siwei Cheng, Shanze Wang, Yin Gu, Xiaoyu Shen, Wei Zhang

Abstract

Visual navigation for cross-embodiment robots is challenging due to variations in robot and camera configurations, which can lead to the failure of navigation tasks. Previous approaches typically rely on collecting massive datasets across different robots, which is highly data-intensive, or fine-tuning models, which is time-consuming. Furthermore, both methods often lack explicit consideration of robot geometry. In this paper, we propose a Cross-embodiment Robot Local Planning (CeRLP) framework for general visual navigation, which abstracts visual information into a unified geometric formulation and applies to heterogeneous robots with varying physical dimensions, camera parameters, and camera types. CeRLP introduces a depth estimation scale correction method that utilizes offline pre-calibration to resolve the scale ambiguity of monocular depth estimation, thereby recovering precise metric depth images. Furthermore, CeRLP designs a visual-to-scan abstraction module that projects varying visual inputs into height-adaptive laser scans, making the policy robust to heterogeneous robots. Experiments in simulation environments demonstrate that CeRLP outperforms comparative methods, validating its robust obstacle avoidance capabilities as a local planner. Additionally, extensive real-world experiments verify the effectiveness of CeRLP in tasks such as point-to-point navigation and vision-language navigation, demonstrating its generalization across varying robot and camera configurations.

CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation

Abstract

Visual navigation for cross-embodiment robots is challenging due to variations in robot and camera configurations, which can lead to the failure of navigation tasks. Previous approaches typically rely on collecting massive datasets across different robots, which is highly data-intensive, or fine-tuning models, which is time-consuming. Furthermore, both methods often lack explicit consideration of robot geometry. In this paper, we propose a Cross-embodiment Robot Local Planning (CeRLP) framework for general visual navigation, which abstracts visual information into a unified geometric formulation and applies to heterogeneous robots with varying physical dimensions, camera parameters, and camera types. CeRLP introduces a depth estimation scale correction method that utilizes offline pre-calibration to resolve the scale ambiguity of monocular depth estimation, thereby recovering precise metric depth images. Furthermore, CeRLP designs a visual-to-scan abstraction module that projects varying visual inputs into height-adaptive laser scans, making the policy robust to heterogeneous robots. Experiments in simulation environments demonstrate that CeRLP outperforms comparative methods, validating its robust obstacle avoidance capabilities as a local planner. Additionally, extensive real-world experiments verify the effectiveness of CeRLP in tasks such as point-to-point navigation and vision-language navigation, demonstrating its generalization across varying robot and camera configurations.
Paper Structure (33 sections, 23 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 33 sections, 23 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: CeRLP addresses the challenge of visual navigation across heterogeneous robots with varying physical dimensions and camera configurations. By transforming diverse visual inputs into a unified virtual laser scan and modeling the robot as a generalized cuboid, the framework effectively achieves obstacle avoidance in unseen environments without any fine-tuning.
  • Figure 2: The overall framework of CeRLP. The system abstracts visual observations from heterogeneous robots into a unified geometric representation. Heterogeneous robots with differing intrinsic parameters, extrinsic parameters, and physical dimensions capture RGB images from cameras. Scale calibration is an offline process utilizing distant and nearby ArUco markers to calculate the scale factor and disparity shift, thereby achieving scale recovery. Depth estimation employs pre-trained Depth Anything V2 to predict relative depth images. Scale Correction rectifies the relative depth image into a metric depth image during inference, subsequently transforming this metric depth image into a height-adaptive virtual laser scan. The dimension-configurable policy utilises this unified laser scan, goal position $\mathbf{g}_{\text{t}}$, and robot state information to generate safe control commands, achieving zero-shot transfer across embodiments. The robot state information contains physical dimensions $\mathbf{body}$, current velocity $\mathbf{v}_{\text{t}}$, and velocity and acceleration limits $\mathbf{L}_{\text{dyn}}$. Furthermore, this module can receive high-level commands, functioning as the lower-level obstacle avoidance planner for VLN.
  • Figure 3: Example of test environment in simulation.
  • Figure 4: The visualization of CeRLP trajectories in test BARN environments. In (a) and (b), the maximum linear velocity was $0.5 \text{ m/s}$. In (c), the maximum linear velocity was $1 \text{ m/s}$.
  • Figure 5: Differential-Drive Mobile Robots (DMR). DMR1-DMR6 were wheeled robots based on the TurtleBot 2 chassis, varying in physical dimension, camera type, and camera position. DMR7 was a Unitree Go2 quadruped robot equipped with a WHEELTEC C100 camera.
  • ...and 6 more figures