Table of Contents
Fetching ...

3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

Minchong Chen, Xiaoyun Yuan, Junzhe Wan, Jianing Zhang, Jun Zhang

TL;DR

3M-TI tackles the challenge of high-quality mobile thermal imaging by delivering a calibration-free, cross-modal diffusion framework that fuses uncalibrated RGB references with low-resolution thermal input in a latent space. It introduces a cross-modal self-attention module to learn cross-modal correspondences within a VAE latent representation, augmented by misalignment transformations to simulate real-world parallax and unsynchronization, and leverages a one-step diffusion process with LoRA fine-tuning. The approach achieves state-of-the-art perceptual quality while preserving fidelity, and it demonstrably improves downstream tasks such as open-vocabulary detection and semantic segmentation on mobile-scale data. Practical validation on a real smartphone system confirms the method’s robustness, making it a valuable tool for robust mobile thermal perception in safety-critical scenarios.

Abstract

The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.

3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

TL;DR

3M-TI tackles the challenge of high-quality mobile thermal imaging by delivering a calibration-free, cross-modal diffusion framework that fuses uncalibrated RGB references with low-resolution thermal input in a latent space. It introduces a cross-modal self-attention module to learn cross-modal correspondences within a VAE latent representation, augmented by misalignment transformations to simulate real-world parallax and unsynchronization, and leverages a one-step diffusion process with LoRA fine-tuning. The approach achieves state-of-the-art perceptual quality while preserving fidelity, and it demonstrably improves downstream tasks such as open-vocabulary detection and semantic segmentation on mobile-scale data. Practical validation on a real smartphone system confirms the method’s robustness, making it a valuable tool for robust mobile thermal perception in safety-critical scenarios.

Abstract

The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A smartphone-based mobile imaging system integrating calibration-free and synchronization-free RGB and thermal cameras. The proposed 3M-TI method delivers superior thermal image quality compared with state-of-the-art restoration approaches.
  • Figure 2: Overview of the 3M-TI architecture. (a) 3M-TI framework. The core of 3M-TI is a one-step diffusion-based model equipped with a cross-modal self-attention module (CSM) and a misalignment augmentation strategy. LoRA fine-tuning is applied to both the UNet and the VAE decoder. (b) Cross-modal self-attention module (CSM). Two rearrangement layers are inserted before and after the original self-attention layers to capture cross-modal correspondences. (c) Misalignment augmentation. A data augmentation strategy designed to enhance model robustness against camera parallax and temporal misalignment between RGB and thermal inputs.
  • Figure 3: Qualitative comparison on our test set (zoom in for details). 3M-TI achieves the most faithful and visually consistent results, exhibiting sharp structures and accurate thermal patterns that best align with the GT.
  • Figure 4: Qualitative comparison on our real-world smartphone dataset (zoom in for details). 3M-TI exhibits remarkable generalization capability, producing sharp and faithful thermal details that closely align with RGB images.
  • Figure 5: Visualization of detection results, where green bounding boxes indicate the correct detection, red bounding boxes indicate the wrong detection.
  • ...and 2 more figures