A Hybrid Autoencoder for Robust Heightmap Generation from Fused Lidar and Depth Data for Humanoid Robot Locomotion
Dennis Bank, Joost Cordes, Thomas Seel, Simon F. G. Ehlers
TL;DR
The paper tackles terrain perception for humanoid locomotion in unstructured environments by replacing monolithic single-sensor pipelines with a learning-based, modular perception-to-control framework. It introduces a hybrid encoder-decoder heightmap (EDS) that fuses depth, LiDAR through spherical projection, and IMU data, producing a robot-centric 2D heightmap and leveraging a GRU core for temporal consistency. A two-stage training regime (unsupervised autoencoder pretraining followed by supervised end-to-end training) and a carefully tuned heightmap grid (0.98 m × 0.70 m at 7 cm cells, centered 0.2 m ahead) yield a reconstruction MAE of 2.19 cm, with multimodal fusion outperforming depth-only and LiDAR-only baselines by 7.2% and 9.9%, respectively; a 3.2 s temporal context reduces drift by about 30%. The resulting perception-to-control pipeline improves anticipatory gait, reduces falls by over 70%, and demonstrates robustness to moderate perceptual noise, offering a scalable path toward autonomous humanoid operation in complex, unstructured environments.
Abstract
Reliable terrain perception is a critical prerequisite for the deployment of humanoid robots in unstructured, human-centric environments. While traditional systems often rely on manually engineered, single-sensor pipelines, this paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation. A hybrid Encoder-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction fused with a Gated Recurrent Unit (GRU) core for temporal consistency. The architecture integrates multimodal data from an Intel RealSense depth camera, a LIVOX MID-360 LiDAR processed via efficient spherical projection, and an onboard IMU. Quantitative results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations. Furthermore, the integration of a 3.2 s temporal context reduces mapping drift.
