Table of Contents
Fetching ...

HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots

Qiang Zhang, Zhang Zhang, Wei Cui, Jingkai Sun, Jiahang Cao, Yijie Guo, Gang Han, Wen Zhao, Jiaxu Wang, Chenghao Sun, Lingfeng Zhang, Hao Cheng, Yujie Chen, Lin Wang, Jian Tang, Renjing Xu

TL;DR

The paper tackles robust environmental perception for humanoid robots with self-occlusion and limited FOV. It introduces HumanoidPano, a three-stage, geometry-aware framework that fuses panoramic vision and LiDAR via Spherical Geometry-aware Constraints, Spatial Deformable Attention, and Panoramic Augmentation to produce real-time BEV semantic maps. It achieves state-of-the-art results on the 360BEV-Matterport benchmark and is validated on a full humanoid platform with a 360° sensing setup and a 10 Hz processing pipeline. The work demonstrates that aligning perception algorithms with humanoid morphology enables robust navigation in complex environments.

Abstract

The perceptual system design for humanoid robots poses unique challenges due to inherent structural constraints that cause severe self-occlusion and limited field-of-view (FOV). We present HumanoidPano, a novel hybrid cross-modal perception framework that synergistically integrates panoramic vision and LiDAR sensing to overcome these limitations. Unlike conventional robot perception systems that rely on monocular cameras or standard multi-sensor configurations, our method establishes geometrically-aware modality alignment through a spherical vision transformer, enabling seamless fusion of 360 visual context with LiDAR's precise depth measurements. First, Spherical Geometry-aware Constraints (SGC) leverage panoramic camera ray properties to guide distortion-regularized sampling offsets for geometric alignment. Second, Spatial Deformable Attention (SDA) aggregates hierarchical 3D features via spherical offsets, enabling efficient 360°-to-BEV fusion with geometrically complete object representations. Third, Panoramic Augmentation (AUG) combines cross-view transformations and semantic alignment to enhance BEV-panoramic feature consistency during data augmentation. Extensive evaluations demonstrate state-of-the-art performance on the 360BEV-Matterport benchmark. Real-world deployment on humanoid platforms validates the system's capability to generate accurate BEV segmentation maps through panoramic-LiDAR co-perception, directly enabling downstream navigation tasks in complex environments. Our work establishes a new paradigm for embodied perception in humanoid robotics.

HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots

TL;DR

The paper tackles robust environmental perception for humanoid robots with self-occlusion and limited FOV. It introduces HumanoidPano, a three-stage, geometry-aware framework that fuses panoramic vision and LiDAR via Spherical Geometry-aware Constraints, Spatial Deformable Attention, and Panoramic Augmentation to produce real-time BEV semantic maps. It achieves state-of-the-art results on the 360BEV-Matterport benchmark and is validated on a full humanoid platform with a 360° sensing setup and a 10 Hz processing pipeline. The work demonstrates that aligning perception algorithms with humanoid morphology enables robust navigation in complex environments.

Abstract

The perceptual system design for humanoid robots poses unique challenges due to inherent structural constraints that cause severe self-occlusion and limited field-of-view (FOV). We present HumanoidPano, a novel hybrid cross-modal perception framework that synergistically integrates panoramic vision and LiDAR sensing to overcome these limitations. Unlike conventional robot perception systems that rely on monocular cameras or standard multi-sensor configurations, our method establishes geometrically-aware modality alignment through a spherical vision transformer, enabling seamless fusion of 360 visual context with LiDAR's precise depth measurements. First, Spherical Geometry-aware Constraints (SGC) leverage panoramic camera ray properties to guide distortion-regularized sampling offsets for geometric alignment. Second, Spatial Deformable Attention (SDA) aggregates hierarchical 3D features via spherical offsets, enabling efficient 360°-to-BEV fusion with geometrically complete object representations. Third, Panoramic Augmentation (AUG) combines cross-view transformations and semantic alignment to enhance BEV-panoramic feature consistency during data augmentation. Extensive evaluations demonstrate state-of-the-art performance on the 360BEV-Matterport benchmark. Real-world deployment on humanoid platforms validates the system's capability to generate accurate BEV segmentation maps through panoramic-LiDAR co-perception, directly enabling downstream navigation tasks in complex environments. Our work establishes a new paradigm for embodied perception in humanoid robotics.

Paper Structure

This paper contains 18 sections, 12 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The humanoid robot autonomously navigates complex environments using HumanoidPano, which fuses panoramic vision and LiDAR to generate real-time BEV semantic maps. The figure depicts motion trajectory while visualizing real-time visual sensor data and perception results in the upper-left corner, alongside detailed illustrations of the HumanoidPano universal perception module.
  • Figure 2: HumanoidPano Framework Overview. The system addresses panoramic image distortion using Spherical Geometric Constraints (SGC) to guide 3D adaptive sampling offsets. By encoding camera ray properties into offset prediction, SGC aligns panoramic features with depth measurements, reducing projection artifacts. Spatial Deformable Attention (SDA) adaptively aggregates geometrically consistent object representations from panoramic images.
  • Figure 3: Visualization of LiDAR Depth Extraction and Projection Process. First, LiDAR captures raw point clouds, which are then projected onto panoramic images using the camera’s intrinsic and extrinsic parameters. This projected depth information is fed into the network, enabling the retrieval of pixel-wise semantics corresponding to the depth values.
  • Figure 4: Hardware design of our universal sensor module, integrating an Insta360 X4 panoramic camera and Livox Mid-360 LiDAR with precise spatial alignment. The compact, lightweight assembly minimizes self-occlusion while maximizing 360° coverage. Its modular design supports flexible deployment across humanoid platforms for real-time panoramic-LiDAR fusion in navigation and manipulation tasks.
  • Figure 5: The data distribution of 360BEV-Matterport visualized by categoryteng2024360bev. We directly adopted the well-established indoor classification scheme and visual schematics from 360BEV-Matterport for dataset presentation, these categories enable humanoid robots to effectively perform indoor navigation tasks.
  • ...and 4 more figures