Table of Contents
Fetching ...

Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics

Ancheng Lin, Jun Li, Yusheng Xiang, Wei Bian, Mukesh Prasad

TL;DR

The paper tackles the problem of estimating surface normals from sparse, noisy outdoor LiDAR scans by fusing 3D geometry with visual semantics from images. It introduces the Hybrid Geometry Transformer (HGT), a multi-modal, transformer-based architecture that aggregates per-point image, geometric, and positional features and processes them through three self-attention blocks to predict unit normal vectors. A practical training strategy combines batching and synthetic Unity3D data to enable learning in large outdoor scenes, with qualitative and quantitative demonstrations on synthetic data and the KITTI dataset, including 3D reconstruction improvements. The findings show that HGT outperforms baselines, is robust to noise, and generalizes well to real-world urban scenes, offering a practical path to improved geometry understanding in autonomous driving contexts. $L = \frac{1}{N}\sum_{j=1}^N \|{\boldsymbol{n}}^{est}_{(j)} - {\boldsymbol{n}}^{gt}_{(j)}\|_2^2$ with $\|{\boldsymbol{n}}^{est}_{(j)}\|_2 = 1$ underpins the learning objective, and the model’s attention maps reveal effective cross-region semantic fusion.

Abstract

High-quality surface normal can help improve geometry estimation in problems faced by autonomous vehicles, such as collision avoidance and occlusion inference. While a considerable volume of literature focuses on densely scanned indoor scenarios, normal estimation during autonomous driving remains an intricate problem due to the sparse, non-uniform, and noisy nature of real-world LiDAR scans. In this paper, we introduce a multi-modal technique that leverages 3D point clouds and 2D colour images obtained from LiDAR and camera sensors for surface normal estimation. We present the Hybrid Geometric Transformer (HGT), a novel transformer-based neural network architecture that proficiently fuses visual semantic and 3D geometric information. Furthermore, we developed an effective learning strategy for the multi-modal data. Experimental results demonstrate the superior effectiveness of our information fusion approach compared to existing methods. It has also been verified that the proposed model can learn from a simulated 3D environment that mimics a traffic scene. The learned geometric knowledge is transferable and can be applied to real-world 3D scenes in the KITTI dataset. Further tasks built upon the estimated normal vectors in the KITTI dataset show that the proposed estimator has an advantage over existing methods.

Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics

TL;DR

The paper tackles the problem of estimating surface normals from sparse, noisy outdoor LiDAR scans by fusing 3D geometry with visual semantics from images. It introduces the Hybrid Geometry Transformer (HGT), a multi-modal, transformer-based architecture that aggregates per-point image, geometric, and positional features and processes them through three self-attention blocks to predict unit normal vectors. A practical training strategy combines batching and synthetic Unity3D data to enable learning in large outdoor scenes, with qualitative and quantitative demonstrations on synthetic data and the KITTI dataset, including 3D reconstruction improvements. The findings show that HGT outperforms baselines, is robust to noise, and generalizes well to real-world urban scenes, offering a practical path to improved geometry understanding in autonomous driving contexts. with underpins the learning objective, and the model’s attention maps reveal effective cross-region semantic fusion.

Abstract

High-quality surface normal can help improve geometry estimation in problems faced by autonomous vehicles, such as collision avoidance and occlusion inference. While a considerable volume of literature focuses on densely scanned indoor scenarios, normal estimation during autonomous driving remains an intricate problem due to the sparse, non-uniform, and noisy nature of real-world LiDAR scans. In this paper, we introduce a multi-modal technique that leverages 3D point clouds and 2D colour images obtained from LiDAR and camera sensors for surface normal estimation. We present the Hybrid Geometric Transformer (HGT), a novel transformer-based neural network architecture that proficiently fuses visual semantic and 3D geometric information. Furthermore, we developed an effective learning strategy for the multi-modal data. Experimental results demonstrate the superior effectiveness of our information fusion approach compared to existing methods. It has also been verified that the proposed model can learn from a simulated 3D environment that mimics a traffic scene. The learned geometric knowledge is transferable and can be applied to real-world 3D scenes in the KITTI dataset. Further tasks built upon the estimated normal vectors in the KITTI dataset show that the proposed estimator has an advantage over existing methods.
Paper Structure (22 sections, 5 equations, 13 figures, 2 tables)

This paper contains 22 sections, 5 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Semantic information in related regions can help estimate surface normals. The figure shows estimation of normal in two regions R1: faraway, R2: near. R1 is more challenging due to sparse 3D points. The estimation can be assisted by visual information in close/related regions (light green boxes). This figure is best viewed in color.
  • Figure 3: Mechanism of representing geometric information in an attention block. i) Tokens from multi-modal inputs are transformed into an association matrix $\mathbf{A}$, where each row determines how a specific position collects information from others. See Section \ref{['sec:model']} for details. ii) Estimating the normal for a single position (denoted in blue), another position (denoted in red) shows high value in the corresponding row in $\mathbf{A}$. iii) Although the blue position has sparse neighbours, its normal vector can be accurately estimated with auxiliary information from the red position, which has sufficient neighbours.
  • Figure 4: The pipeline of estimating surface normal from image and LiDAR points.
  • Figure 5: Architecture of U-Net RonnebergerFB15.
  • Figure 6: Transformer encoder with three self-attention blocks.
  • ...and 8 more figures