Table of Contents
Fetching ...

Exploring Multi-modal Neural Scene Representations With Applications on Thermal Imaging

Mert Özer, Maximilian Weiherer, Martin Hundhausen, Bernhard Egger

TL;DR

This work systematically examines how to fuse a second modality with RGB in Neural Radiance Fields, using thermal imaging as a challenging benchmark. It introduces the ThermalMix dataset and four fusion strategies based on a shared NeRF backbone (Instant-NGP), finding that RGB-X—a single multi-modal representation with a second-modality branch—delivers the strongest thermal reconstructions and robust RGB results, with results extending to NIR and depth. The study provides practical guidance for building general multi-modal neural scene representations and offers a public benchmark to advance cross-modality calibration research. Overall, the findings suggest RGB-X as a flexible and effective approach for integrating diverse modalities into neural scene representations with real-world impact in surveillance, agriculture, and medical imaging applications.

Abstract

Neural Radiance Fields (NeRFs) quickly evolved as the new de-facto standard for the task of novel view synthesis when trained on a set of RGB images. In this paper, we conduct a comprehensive evaluation of neural scene representations, such as NeRFs, in the context of multi-modal learning. Specifically, we present four different strategies of how to incorporate a second modality, other than RGB, into NeRFs: (1) training from scratch independently on both modalities; (2) pre-training on RGB and fine-tuning on the second modality; (3) adding a second branch; and (4) adding a separate component to predict (color) values of the additional modality. We chose thermal imaging as second modality since it strongly differs from RGB in terms of radiosity, making it challenging to integrate into neural scene representations. For the evaluation of the proposed strategies, we captured a new publicly available multi-view dataset, ThermalMix, consisting of six common objects and about 360 RGB and thermal images in total. We employ cross-modality calibration prior to data capturing, leading to high-quality alignments between RGB and thermal images. Our findings reveal that adding a second branch to NeRF performs best for novel view synthesis on thermal images while also yielding compelling results on RGB. Finally, we also show that our analysis generalizes to other modalities, including near-infrared images and depth maps. Project page: https://mert-o.github.io/ThermalNeRF/.

Exploring Multi-modal Neural Scene Representations With Applications on Thermal Imaging

TL;DR

This work systematically examines how to fuse a second modality with RGB in Neural Radiance Fields, using thermal imaging as a challenging benchmark. It introduces the ThermalMix dataset and four fusion strategies based on a shared NeRF backbone (Instant-NGP), finding that RGB-X—a single multi-modal representation with a second-modality branch—delivers the strongest thermal reconstructions and robust RGB results, with results extending to NIR and depth. The study provides practical guidance for building general multi-modal neural scene representations and offers a public benchmark to advance cross-modality calibration research. Overall, the findings suggest RGB-X as a flexible and effective approach for integrating diverse modalities into neural scene representations with real-world impact in surveillance, agriculture, and medical imaging applications.

Abstract

Neural Radiance Fields (NeRFs) quickly evolved as the new de-facto standard for the task of novel view synthesis when trained on a set of RGB images. In this paper, we conduct a comprehensive evaluation of neural scene representations, such as NeRFs, in the context of multi-modal learning. Specifically, we present four different strategies of how to incorporate a second modality, other than RGB, into NeRFs: (1) training from scratch independently on both modalities; (2) pre-training on RGB and fine-tuning on the second modality; (3) adding a second branch; and (4) adding a separate component to predict (color) values of the additional modality. We chose thermal imaging as second modality since it strongly differs from RGB in terms of radiosity, making it challenging to integrate into neural scene representations. For the evaluation of the proposed strategies, we captured a new publicly available multi-view dataset, ThermalMix, consisting of six common objects and about 360 RGB and thermal images in total. We employ cross-modality calibration prior to data capturing, leading to high-quality alignments between RGB and thermal images. Our findings reveal that adding a second branch to NeRF performs best for novel view synthesis on thermal images while also yielding compelling results on RGB. Finally, we also show that our analysis generalizes to other modalities, including near-infrared images and depth maps. Project page: https://mert-o.github.io/ThermalNeRF/.
Paper Structure (17 sections, 6 equations, 14 figures, 4 tables)

This paper contains 17 sections, 6 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Overview of the four strategies that we compare within this work. In the first strategy (TS), we train a NeRF-like base model (Instant-NGP Mueller2022 in our case) from scratch, separately for RGB and the second modality. In the second strategy (FT), we first pre-train our base model on RGB data and then fine-tune on images of the second modality. While RGB-X adds a second branch, strategy four (SC) adds an extra network to predict values of the additional modality. Note that RGB-X and SC yield a single, multi-modal scene representation, whereas TS and FT always result in two separate models, one for each modality.
  • Figure 1: Demonstration of how challenging it is to compute reliable camera poses from thermal images. We visualize feature correspondences between two views on RGB (top row) and thermal images (bottom row; found and matched using COLMAP Schoenberger2016).
  • Figure 2: Overview of our newly-captured dataset containing high-quality aligned RGB and thermal images of six common objects. Face, Hand, and Panel are forward-facing scenes consisting of around 40 images each. Lion, Pan, and Laptop are 360-degree scenes, where each scene has around 80 images.
  • Figure 2: Comparison of the reconstructed geometry in RGB (first row) and thermal images (second and third row), shown on TS and RGB-$t$. The first column shows novel renderings, the second column visualizes accumulated densities for each pixel along its respective ray, and the third column depicts estimated depth maps. As can be observed clearly, thermal-derived geometry greatly benefits from utilizing RGB densities (especially seen in the depth maps arising from TS trained solely on thermal images (second row) and depth maps produced by RGB-$t$ (third row), which, contrary to TS, incorporates RGB information).
  • Figure 3: Reconstructions of a (left-out) thermal image from multi-modal neural scene representations trained on RGB and thermal data, arising from the four strategies that we compare. For each view, we also report PSNR and SSIM (higher is better). Closest view denotes the nearest image in the training set.
  • ...and 9 more figures