Table of Contents
Fetching ...

A Hybrid Autoencoder for Robust Heightmap Generation from Fused Lidar and Depth Data for Humanoid Robot Locomotion

Dennis Bank, Joost Cordes, Thomas Seel, Simon F. G. Ehlers

TL;DR

The paper tackles terrain perception for humanoid locomotion in unstructured environments by replacing monolithic single-sensor pipelines with a learning-based, modular perception-to-control framework. It introduces a hybrid encoder-decoder heightmap (EDS) that fuses depth, LiDAR through spherical projection, and IMU data, producing a robot-centric 2D heightmap and leveraging a GRU core for temporal consistency. A two-stage training regime (unsupervised autoencoder pretraining followed by supervised end-to-end training) and a carefully tuned heightmap grid (0.98 m × 0.70 m at 7 cm cells, centered 0.2 m ahead) yield a reconstruction MAE of 2.19 cm, with multimodal fusion outperforming depth-only and LiDAR-only baselines by 7.2% and 9.9%, respectively; a 3.2 s temporal context reduces drift by about 30%. The resulting perception-to-control pipeline improves anticipatory gait, reduces falls by over 70%, and demonstrates robustness to moderate perceptual noise, offering a scalable path toward autonomous humanoid operation in complex, unstructured environments.

Abstract

Reliable terrain perception is a critical prerequisite for the deployment of humanoid robots in unstructured, human-centric environments. While traditional systems often rely on manually engineered, single-sensor pipelines, this paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation. A hybrid Encoder-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction fused with a Gated Recurrent Unit (GRU) core for temporal consistency. The architecture integrates multimodal data from an Intel RealSense depth camera, a LIVOX MID-360 LiDAR processed via efficient spherical projection, and an onboard IMU. Quantitative results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations. Furthermore, the integration of a 3.2 s temporal context reduces mapping drift.

A Hybrid Autoencoder for Robust Heightmap Generation from Fused Lidar and Depth Data for Humanoid Robot Locomotion

TL;DR

The paper tackles terrain perception for humanoid locomotion in unstructured environments by replacing monolithic single-sensor pipelines with a learning-based, modular perception-to-control framework. It introduces a hybrid encoder-decoder heightmap (EDS) that fuses depth, LiDAR through spherical projection, and IMU data, producing a robot-centric 2D heightmap and leveraging a GRU core for temporal consistency. A two-stage training regime (unsupervised autoencoder pretraining followed by supervised end-to-end training) and a carefully tuned heightmap grid (0.98 m × 0.70 m at 7 cm cells, centered 0.2 m ahead) yield a reconstruction MAE of 2.19 cm, with multimodal fusion outperforming depth-only and LiDAR-only baselines by 7.2% and 9.9%, respectively; a 3.2 s temporal context reduces drift by about 30%. The resulting perception-to-control pipeline improves anticipatory gait, reduces falls by over 70%, and demonstrates robustness to moderate perceptual noise, offering a scalable path toward autonomous humanoid operation in complex, unstructured environments.

Abstract

Reliable terrain perception is a critical prerequisite for the deployment of humanoid robots in unstructured, human-centric environments. While traditional systems often rely on manually engineered, single-sensor pipelines, this paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation. A hybrid Encoder-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction fused with a Gated Recurrent Unit (GRU) core for temporal consistency. The architecture integrates multimodal data from an Intel RealSense depth camera, a LIVOX MID-360 LiDAR processed via efficient spherical projection, and an onboard IMU. Quantitative results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations. Furthermore, the integration of a 3.2 s temporal context reduces mapping drift.
Paper Structure (13 sections, 11 figures)

This paper contains 13 sections, 11 figures.

Figures (11)

  • Figure 1: Training environment in Isaac Lab lab. Thousands of robots train to walk in different environments in parallel.
  • Figure 2: Different heightmap configurations were investigated. The distance between the points, the number of points in the longitudinal and lateral direction, as well as the position of the height map relative to the robot, were optimized.
  • Figure 3: Overview of the heightmap and the different sensor modalities. Substantial parts of the heightmap are not directly covered by the depth camera or the LiDAR and need to be reconstructed using information from the past, when they were in view.
  • Figure 4: EDS used to predict the heightmap. It consists of pretrained encoders that compress the information from the depth camera and LiDAR. Furthermore, it takes the current robot state as well as the previous heightmap to predict the current heightmap.
  • Figure 5: Evaluation of the pretrained encoder for the depth camera. The original image can be reconstructed well, showing only small errors at the edges of objects.
  • ...and 6 more figures