Table of Contents
Fetching ...

MorphoNavi: Aerial-Ground Robot Navigation with Object Oriented Mapping in Digital Twin

Sausar Karaf, Mikhail Martynov, Oleg Sautenkov, Zhanibek Darush, Dzmitry Tsetserukou

TL;DR

This work tackles open-world navigation for aerial-ground robots using a single monocular camera, removing the need for depth sensors and extensive retraining. It combines monocular depth estimation with a geometric distance model and depth refinements from Depth Anything and Segment Anything to produce semantically rich maps that support high-level planning, visualized through a Unity-based digital twin. In simulated search-and-rescue experiments, the system achieved a 97.4% object-detection rate, a 13.6 cm mean position error, and roughly 7.34 s per image, demonstrating feasibility for cluttered environments with modest computation. The results suggest significant practical impact by reducing hardware and bandwidth requirements while enabling richer scene understanding and integration with vision-language modules for improved decision-making in autonomous aerial-ground navigation.

Abstract

This paper presents a novel mapping approach for a universal aerial-ground robotic system utilizing a single monocular camera. The proposed system is capable of detecting a diverse range of objects and estimating their positions without requiring fine-tuning for specific environments. The system's performance was evaluated through a simulated search-and-rescue scenario, where the MorphoGear robot successfully located a robotic dog while an operator monitored the process. This work contributes to the development of intelligent, multimodal robotic systems capable of operating in unstructured environments.

MorphoNavi: Aerial-Ground Robot Navigation with Object Oriented Mapping in Digital Twin

TL;DR

This work tackles open-world navigation for aerial-ground robots using a single monocular camera, removing the need for depth sensors and extensive retraining. It combines monocular depth estimation with a geometric distance model and depth refinements from Depth Anything and Segment Anything to produce semantically rich maps that support high-level planning, visualized through a Unity-based digital twin. In simulated search-and-rescue experiments, the system achieved a 97.4% object-detection rate, a 13.6 cm mean position error, and roughly 7.34 s per image, demonstrating feasibility for cluttered environments with modest computation. The results suggest significant practical impact by reducing hardware and bandwidth requirements while enabling richer scene understanding and integration with vision-language modules for improved decision-making in autonomous aerial-ground navigation.

Abstract

This paper presents a novel mapping approach for a universal aerial-ground robotic system utilizing a single monocular camera. The proposed system is capable of detecting a diverse range of objects and estimating their positions without requiring fine-tuning for specific environments. The system's performance was evaluated through a simulated search-and-rescue scenario, where the MorphoGear robot successfully located a robotic dog while an operator monitored the process. This work contributes to the development of intelligent, multimodal robotic systems capable of operating in unstructured environments.

Paper Structure

This paper contains 24 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Experimental setup. The mission is to overcome obstacles and search for the hidden robot.
  • Figure 2: Aerial-Ground Vehicle MorphoGear.
  • Figure 3: Virtual simulation and visualization for MorphoGear.
  • Figure 4: The system architecture of the mapping pipeline.
  • Figure 5: Accuracy of position estimates.
  • ...and 1 more figures