No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

Cho-Ying Wu; Zixun Huang; Xinyu Huang; Liu Ren

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

Cho-Ying Wu, Zixun Huang, Xinyu Huang, Liu Ren

TL;DR

The first study of cross-sensor view synthesis across different modalities is presented, using the proposed confidence-aware densification and self-matching filtering to attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS).

Abstract

We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

TL;DR

Abstract

Paper Structure (13 sections, 7 equations, 7 figures, 7 tables)

This paper contains 13 sections, 7 equations, 7 figures, 7 tables.

Introduction
Related Work
RGB-X Task and Data Curation
Cross-Modal Image Matching
Methods
RGB-X Matching
Confidence-Aware Densification and Fusion
Self-Matching Filtering and 3D Consolidation
Experiments
RGB-Thermal
RGB-NIR
RGB-SAR
Conclusion

Figures (7)

Figure 1: Problem Setup. Given unpaired RGB-X images from sensors, the task is to synthesize X-images that are pixel-wise aligned with the RGB views for multi-modal applications. Traditional 3D approaches rely on complete 3D priors—including depth and the poses/intrinsics of both modalities—to align and render cross-sensor images. In contrast, our scalable framework removes these dependencies, enabling RGB-guided X-image synthesis without the 3D priors for X to replace calibration for different types of sensors and metric depth acquisition.
Figure 2: Homography warping assumes 3D planar structures and causes visible misalignment (statue areas) when the scene contains distinct fore-/background layers.
Figure 3: Method Overview. Our approach consists of three stages. In the first stage, we perform cross-modality feature matching to establish correspondences between RGB and X-images. The matched points are sampled and accumulated onto RGB views to produce semi-dense X-images $\mathcal{X}_m$ along with multi-level confidence maps $C_{m}$. In the second stage, we conduct RGB-guided densification to get dense X-images from RGB with semi-dense X as cues. Our Confidence-Aware Densification and Fusion module integrates confidence maps from the image matching stage to guide the densification to concentrate on higher-confidence points and get robust $\mathcal{X}_d$ In the final stage, our proposed self-matching mechanism further filters inconsistent patches, and the results are fed back into the densification stage for refinement. To further improve multi-view consistency, we train RGB-X 3DGS using COLMAP-calibrated RGB views to consolidate both modalities into a unified 3D RGB-X radiance field, which improves multi-view consistency and further enables cross-sensor view synthesis.
Figure 4: Visual Results on METU-VisTIR-Cloudy. Our results attain much clearer, sharper, and smoother surface for rendering.
Figure 5: Comparison on Temporal Consistency for Image Generation. StyleBooth Han_2025_ICCV generation for thermal images cannot guarantee temporal consistency due to inherent ambiguity, while ours densification creates more consistent multi-views. NIR is closer to the visual spectrum and thus easier to ensure consistency, but the specialized method PixNext jin2025pix2next still cannot ensure the correct intensity. Compared with our strategy, image translation from the original domain still suffers from inaccurate transformations, whereas our match-densify-consolidate approach uses information from the target domain as anchors for densification and achieves better results.
...and 2 more figures

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

TL;DR

Abstract

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (7)