Table of Contents
Fetching ...

KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird's-Eye-View Segmentation

Wenke E, Yixin Sun, Jiaxu Liu, Hubert P. H. Shum, Amir Atapour-Abarghouei, Toby P. Breckon

TL;DR

The paper tackles BEV segmentation from a single 360° camera by enabling training-time guidance from a LiDAR–camera Teacher. It introduces a unified LiDAR image representation, a voxel-aligned view transformer, a Soft-Gated Fusion Module, and an Auxiliary Module to distill rich multimodal knowledge into a lightweight camera-only Student. Key contributions include the first cross-modality distillation framework for 360° BEV, a voxel-aligned projection that preserves geometry, and comprehensive evaluations on Dur360BEV and KITTI-360 showing improved accuracy and real-time performance. The approach reduces sensor complexity while delivering practical, deployment-friendly BEV segmentation for autonomous driving.

Abstract

We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.

KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird's-Eye-View Segmentation

TL;DR

The paper tackles BEV segmentation from a single 360° camera by enabling training-time guidance from a LiDAR–camera Teacher. It introduces a unified LiDAR image representation, a voxel-aligned view transformer, a Soft-Gated Fusion Module, and an Auxiliary Module to distill rich multimodal knowledge into a lightweight camera-only Student. Key contributions include the first cross-modality distillation framework for 360° BEV, a voxel-aligned projection that preserves geometry, and comprehensive evaluations on Dur360BEV and KITTI-360 showing improved accuracy and real-time performance. The approach reduces sensor complexity while delivering practical, deployment-friendly BEV segmentation for autonomous driving.

Abstract

We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.

Paper Structure

This paper contains 35 sections, 9 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of cross-modality channel-wise knowledge distillation from fused LiDAR–camera features (Teacher) to a single 360-degree camera model (Student) for enhanced feature representation and scene understanding.
  • Figure 2: Illustration of the Dur360BEV dataset dur360bev2025. (a) LiDAR data in equirectangular representation [Top: range image; Middle: intensity image; Bottom: ambient image]. (b) Dual-fisheye spherical image. (c) Equirectangular-projected 360-degree image.
  • Figure 3: Sparse Voxel Pulling Module (View Transformer). 3D voxels derived from LiDAR point cloud are mapped to localised equirectangular LiDAR features, then bilinearly interpolated to form 3D BEV features.
  • Figure 4: Overview of the proposed KD360-VoxelBEV architecture.Teacher network (green): equipped with the SGFM (blue dashed block), which integrates LiDAR range, intensity, and ambient cues with 360-degree camera features to produce enriched BEV representations. AM (pink): fuses the Student and pre-trained LiDAR branch during training to reduce the feature gap between Teacher and Student, providing additional reliable guidance. Student network (blue): a camera-only BEV segmentation model that benefits from cross-modal distillation, achieving robust BEV predictions from a single 360-degree input image. Distillation (red dashed block): highlights the regions where multi-channel dense feature distillation is applied, specifically between Teacher–Student and Student–Auxiliary pairs. At inference, only the Student network is employed, ensuring lightweight and deployment-friendly BEV segmentation.
  • Figure 5: Illustration of distillation and auxiliary details. Feature maps from the decoder are used for channel-wise distillation, which is applied between Teacher and Student as well as between Auxiliary and Student, as indicated by the red dashed arrows.
  • ...and 6 more figures