Table of Contents
Fetching ...

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle

Abstract

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Abstract

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.
Paper Structure (30 sections, 1 equation, 16 figures, 17 tables)

This paper contains 30 sections, 1 equation, 16 figures, 17 tables.

Figures (16)

  • Figure 1: SEAR architecture. RGB and thermal images are first tokenized using DINOv2. For each modality, camera-specific tokens are concatenated with the corresponding DINO tokens. The combined tokens are then processed by an Alternating-Attention (AA) module with LoRA adapters. Finally, the refined tokens are passed to separate prediction heads for camera parameter estimation and depth estimation. Trainable parameters are highlighted with a flame symbol.
  • Figure 2: The SEAR dataset includes 9 scenes, each with paired RGB-thermal images captured along two distinct trajectories. Ground-truth poses (red/blue for each trajectory) are estimated via VGGT on all RGB images. The top 6 scenes feature trajectories under similar lighting, while the bottom 3 have large lighting variations (some RGB images are near fully black).
  • Figure 3: Qualitative results comparing RGB/thermal reconstructions (camera poses in red/blue); we show results for 4 methods at rows 1, 3, and 5, and zoom in on more interesting reconstruction details in rows 2, 4, and 6. Our method (SEAR) achieves higher accuracy, consistency, and level of detail than other methods.
  • Figure 4: Reconstructions from the SmokeSeer3D dataset (dense smoke, top) and our new dataset's scenes (lighting changes, middle and bottom rows). The first column shows cases where RGB images (right) are unreliable for localization, so thermal images (left) are used. Our method recovers the scene even with smoke (SmokeSeer3D) and aligns RGB and thermal camera poses in different lighting conditions (our dataset).
  • Figure 5: The AUC, RRA, RTA (errors <$30^\circ$, $15^\circ$, $5^\circ$) across varying thermal-to-RGB image ratios. The filled area represents the boundary from $0.25-$ to $0.75-$quantiles estimated by bootstrapping scenes $2000$ times.
  • ...and 11 more figures