Table of Contents
Fetching ...

AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis

Tao Tang, Guangrun Wang, Yixing Lao, Peng Chen, Jie Liu, Liang Lin, Kaicheng Yu, Xiaodan Liang

TL;DR

This paper addresses the misalignment between LiDAR and camera modalities in multimodal NeRF fusion, which prevents joint LiDAR-camera synthesis from outperforming unimodal baselines. It introduces AlignMiF, comprised of Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI), to align the coarse geometry across modalities while preserving modality-specific details, by decomposing hash encodings and leveraging a pre-trained field-based initialization. Across KITTI-360, Waymo, and synthetic AIODrive data, AlignMiF delivers substantial improvements in image PSNR (+2.01, +3.11) and LiDAR Chamfer Distance reductions (~13.8%–14.2%), outperforming both single-modality baselines and prior multimodal fusion methods. The work provides a principled approach to multimodal implicit-field fusion that emphasizes geometric alignment, with practical implications for more accurate joint scene synthesis and potential downstream benefits in perception tasks such as fusion-based detection.

Abstract

Neural implicit fields have been a de facto standard in novel view synthesis. Recently, there exist some methods exploring fusing multiple modalities within a single field, aiming to share implicit features from different modalities to enhance reconstruction performance. However, these modalities often exhibit misaligned behaviors: optimizing for one modality, such as LiDAR, can adversely affect another, like camera performance, and vice versa. In this work, we conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera joint synthesis, revealing the underlying issue lies in the misalignment of different sensors. Furthermore, we introduce AlignMiF, a geometrically aligned multimodal implicit field with two proposed modules: Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI). These modules effectively align the coarse geometry across different modalities, significantly enhancing the fusion process between LiDAR and camera data. Through extensive experiments across various datasets and scenes, we demonstrate the effectiveness of our approach in facilitating better interaction between LiDAR and camera modalities within a unified neural field. Specifically, our proposed AlignMiF, achieves remarkable improvement over recent implicit fusion methods (+2.01 and +3.11 image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer Distance on the respective datasets).

AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis

TL;DR

This paper addresses the misalignment between LiDAR and camera modalities in multimodal NeRF fusion, which prevents joint LiDAR-camera synthesis from outperforming unimodal baselines. It introduces AlignMiF, comprised of Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI), to align the coarse geometry across modalities while preserving modality-specific details, by decomposing hash encodings and leveraging a pre-trained field-based initialization. Across KITTI-360, Waymo, and synthetic AIODrive data, AlignMiF delivers substantial improvements in image PSNR (+2.01, +3.11) and LiDAR Chamfer Distance reductions (~13.8%–14.2%), outperforming both single-modality baselines and prior multimodal fusion methods. The work provides a principled approach to multimodal implicit-field fusion that emphasizes geometric alignment, with practical implications for more accurate joint scene synthesis and potential downstream benefits in perception tasks such as fusion-based detection.

Abstract

Neural implicit fields have been a de facto standard in novel view synthesis. Recently, there exist some methods exploring fusing multiple modalities within a single field, aiming to share implicit features from different modalities to enhance reconstruction performance. However, these modalities often exhibit misaligned behaviors: optimizing for one modality, such as LiDAR, can adversely affect another, like camera performance, and vice versa. In this work, we conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera joint synthesis, revealing the underlying issue lies in the misalignment of different sensors. Furthermore, we introduce AlignMiF, a geometrically aligned multimodal implicit field with two proposed modules: Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI). These modules effectively align the coarse geometry across different modalities, significantly enhancing the fusion process between LiDAR and camera data. Through extensive experiments across various datasets and scenes, we demonstrate the effectiveness of our approach in facilitating better interaction between LiDAR and camera modalities within a unified neural field. Specifically, our proposed AlignMiF, achieves remarkable improvement over recent implicit fusion methods (+2.01 and +3.11 image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer Distance on the respective datasets).
Paper Structure (21 sections, 5 equations, 15 figures, 10 tables)

This paper contains 21 sections, 5 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: The misalignment issue in multimodal implicit field. For implicit neural fusion, there is a trade-off between the modalities due to the misalignment, making it challenging to improve both modalities simultaneously. Conversely, our method addresses the misalignment issue and achieves boosted multimodal performance. The metrics are PSNR and Chamfer Distance (C-D).
  • Figure 2: Analysis of misalignment from raw sensor inputs. (a) Original image, (b) Image with projected points from associate LiDAR frame, (c) LiDAR points of all scene frames. As highlighted in the red box, the observations obtained from LiDAR and camera sensors for the same pole are distinct (zoom-in for better views).
  • Figure 3: Analysis of misalignment from bird's eye view hash grid features. We show the first 4 levels of the hash features on the x-y plane. The camera is front-facing along the trajectory and brighter or more saturated colors represent higher feature values.
  • Figure 4: The illustration of our AlignMiF framework. The proposed Geometry-Aware Alignment (GAA) of the decomposed hash encoding and the Shared Geometry Initialization (SGI) are incorporated together to tackle the misalignment issue.
  • Figure 5: Analysis of misalignment from the density values and qualitative analysis of our propose GAA.
  • ...and 10 more figures