Table of Contents
Fetching ...

Self-supervised cost of transport estimation for multimodal path planning

Vincent Gherold, Ioannis Mandralis, Eric Sihite, Adarsh Salagame, Alireza Ramezani, Morteza Gharib

TL;DR

This work introduces a self-supervised RGB-D pipeline to estimate pixel-wise cost of transport (COT) for multimodal robots, enabling energy-aware path planning. By projecting COT predictions into Bird's Eye View maps and fusing local maps into a global traversability representation, the approach supports real-time navigation on constrained hardware like the Nvidia Jetson Orin Nano. The method leverages self-supervised label generation with trajectory-based COT computation, SAM-based augmentation, and an autoencoder confidence mechanism, selecting AsymFormer as the best-performing model with strong MSE and inference efficiency. Practically, the framework demonstrates energy-efficient routing via A* in real-world terrains, highlighting its potential to unlock multimodal robots’ navigation and exploration capabilities.

Abstract

Autonomous robots operating in real environments are often faced with decisions on how best to navigate their surroundings. In this work, we address a particular instance of this problem: how can a robot autonomously decide on the energetically optimal path to follow given a high-level objective and information about the surroundings? To tackle this problem we developed a self-supervised learning method that allows the robot to estimate the cost of transport of its surroundings using only vision inputs. We apply our method to the multi-modal mobility morphobot (M4), a robot that can drive, fly, segway, and crawl through its environment. By deploying our system in the real world, we show that our method accurately assigns different cost of transports to various types of environments e.g. grass vs smooth road. We also highlight the low computational cost of our method, which is deployed on an Nvidia Jetson Orin Nano robotic compute unit. We believe that this work will allow multi-modal robotic platforms to unlock their full potential for navigation and exploration tasks.

Self-supervised cost of transport estimation for multimodal path planning

TL;DR

This work introduces a self-supervised RGB-D pipeline to estimate pixel-wise cost of transport (COT) for multimodal robots, enabling energy-aware path planning. By projecting COT predictions into Bird's Eye View maps and fusing local maps into a global traversability representation, the approach supports real-time navigation on constrained hardware like the Nvidia Jetson Orin Nano. The method leverages self-supervised label generation with trajectory-based COT computation, SAM-based augmentation, and an autoencoder confidence mechanism, selecting AsymFormer as the best-performing model with strong MSE and inference efficiency. Practically, the framework demonstrates energy-efficient routing via A* in real-world terrains, highlighting its potential to unlock multimodal robots’ navigation and exploration capabilities.

Abstract

Autonomous robots operating in real environments are often faced with decisions on how best to navigate their surroundings. In this work, we address a particular instance of this problem: how can a robot autonomously decide on the energetically optimal path to follow given a high-level objective and information about the surroundings? To tackle this problem we developed a self-supervised learning method that allows the robot to estimate the cost of transport of its surroundings using only vision inputs. We apply our method to the multi-modal mobility morphobot (M4), a robot that can drive, fly, segway, and crawl through its environment. By deploying our system in the real world, we show that our method accurately assigns different cost of transports to various types of environments e.g. grass vs smooth road. We also highlight the low computational cost of our method, which is deployed on an Nvidia Jetson Orin Nano robotic compute unit. We believe that this work will allow multi-modal robotic platforms to unlock their full potential for navigation and exploration tasks.

Paper Structure

This paper contains 18 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Close-up view of our robot platform, the M4 robot, capable of multiple modes of locomotion, including driving, flying, and walking. The robot is equipped with an RGB-D camera along with an embedded companion computer.
  • Figure 2: Results of our self-supervised COT. The left image shows the terrain that the robot will encounter, composed of road, grass, and vegetation. The middle image displays the trajectory taken by the robot, while the right image presents the resulting COT map, colored using a colormap ranging from 0.5 to 2. On this COT map, the A* algorithm has been applied to illustrate how COT influences the optimal path to minimize energy consumption along the trajectory. The red path represents the most efficient route, with a total aggregate COT of 817 over a distance of 59 meters. In contrast, the yellow path is a suboptimal solution with a COT of 839 and a distance of 56 meters. Despite being longer, the red path is more energy-efficient because it predominantly follows the road. A video of the mapping is available at https://www.youtube.com/watch?v=tnxdjiAG2Sc
  • Figure 3: Our RGBD COT model takes as inputs an RGB image and depth image and outputs a pixel-wise COT image. RGB and Depth are of size $3\times H\times W$ and $1\times H\times W$ respectively. Then the pixel-wise COT image is projected in a local BEV map using the robot position and the depth image. A heuristic map merger combines all the local BEV maps into a global map.
  • Figure 4: The left image displays the camera's view, while the right one shows the rendered world. The elements have been colored for visualization purposes. The triangle mesh COT is represented in black, the point cloud in red, and the non-traversable elements from assumption (3) are colored in blue. The path length and the region size for the blue points have been determined using the maximum depth distance of the camera. On the right image, blue pixels would be assigned a high COT, black a low COT, and the red and white would be assigned an unknown COT.
  • Figure 5: Autoencoder architecture for labeling unlabeled data. It takes as input RGBD images in a $4\times H \times W$ tensor. This architecture is composed of two CNN (Convolutional Neural Network) autoencoders, nested one into the other. $I$ denote the input data, $L_1$ the first latent space, $L_2$ the second latent space and $O$ the output space.
  • ...and 2 more figures