Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics
Ancheng Lin, Jun Li, Yusheng Xiang, Wei Bian, Mukesh Prasad
TL;DR
The paper tackles the problem of estimating surface normals from sparse, noisy outdoor LiDAR scans by fusing 3D geometry with visual semantics from images. It introduces the Hybrid Geometry Transformer (HGT), a multi-modal, transformer-based architecture that aggregates per-point image, geometric, and positional features and processes them through three self-attention blocks to predict unit normal vectors. A practical training strategy combines batching and synthetic Unity3D data to enable learning in large outdoor scenes, with qualitative and quantitative demonstrations on synthetic data and the KITTI dataset, including 3D reconstruction improvements. The findings show that HGT outperforms baselines, is robust to noise, and generalizes well to real-world urban scenes, offering a practical path to improved geometry understanding in autonomous driving contexts. $L = \frac{1}{N}\sum_{j=1}^N \|{\boldsymbol{n}}^{est}_{(j)} - {\boldsymbol{n}}^{gt}_{(j)}\|_2^2$ with $\|{\boldsymbol{n}}^{est}_{(j)}\|_2 = 1$ underpins the learning objective, and the model’s attention maps reveal effective cross-region semantic fusion.
Abstract
High-quality surface normal can help improve geometry estimation in problems faced by autonomous vehicles, such as collision avoidance and occlusion inference. While a considerable volume of literature focuses on densely scanned indoor scenarios, normal estimation during autonomous driving remains an intricate problem due to the sparse, non-uniform, and noisy nature of real-world LiDAR scans. In this paper, we introduce a multi-modal technique that leverages 3D point clouds and 2D colour images obtained from LiDAR and camera sensors for surface normal estimation. We present the Hybrid Geometric Transformer (HGT), a novel transformer-based neural network architecture that proficiently fuses visual semantic and 3D geometric information. Furthermore, we developed an effective learning strategy for the multi-modal data. Experimental results demonstrate the superior effectiveness of our information fusion approach compared to existing methods. It has also been verified that the proposed model can learn from a simulated 3D environment that mimics a traffic scene. The learned geometric knowledge is transferable and can be applied to real-world 3D scenes in the KITTI dataset. Further tasks built upon the estimated normal vectors in the KITTI dataset show that the proposed estimator has an advantage over existing methods.
