Table of Contents
Fetching ...

PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

Muhua Zhu, Xinhao Jin, Chengbo Wang, Yongcong Zhang, Yifei Xue, Tie Ji, Yizhen Lao

TL;DR

PIS3R addresses the failure of traditional image stitching under very large parallax by leveraging a deep 3D reconstruction framework (VGGT) to recover camera parameters and dense scene geometry, followed by per-pixel point-cloud reprojection to generate an initial, geometry-consistent stitch. A point-conditioned diffusion model (RDDM) refines the reprojected image to fill holes and suppress artifacts while preserving 3D structure, enabling direct use in downstream 3D vision tasks like SfM and SLAM. The approach is validated on synthetic, real-world, and large-parallax datasets, with ablations showing the superiority of VGGT for reconstruction and RDDM for refinement; running-time analyses indicate practical performance. Overall, PIS3R offers a significant step forward for very large parallax stitching by ensuring geometric fidelity alongside visual quality, expanding applicability to 3D reconstruction pipelines and related applications.

Abstract

Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined result.Compared with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

TL;DR

PIS3R addresses the failure of traditional image stitching under very large parallax by leveraging a deep 3D reconstruction framework (VGGT) to recover camera parameters and dense scene geometry, followed by per-pixel point-cloud reprojection to generate an initial, geometry-consistent stitch. A point-conditioned diffusion model (RDDM) refines the reprojected image to fill holes and suppress artifacts while preserving 3D structure, enabling direct use in downstream 3D vision tasks like SfM and SLAM. The approach is validated on synthetic, real-world, and large-parallax datasets, with ablations showing the superiority of VGGT for reconstruction and RDDM for refinement; running-time analyses indicate practical performance. Overall, PIS3R offers a significant step forward for very large parallax stitching by ensuring geometric fidelity alongside visual quality, expanding applicability to 3D reconstruction pipelines and related applications.

Abstract

Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined result.Compared with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

Paper Structure

This paper contains 27 sections, 16 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The relationship between quality of stitched image and parallax variation. Our PIS3R method maintains significantly higher stability than UDIS++ across all three metrics (PSNR, SSIM, and LPIPS). As Parallax level $P_L$ (Sec. B, Appendix) increases, stitched image quality from UDIS++ degrades substantially, whereas PIS3R demonstrates superior robustness to parallax variations. Moreover, PIS3R delivers comparable visual quality to state-of-the-art methods under pure rotation and slight parallax.
  • Figure 2: Overview of the PIS3R pipeline. Given a sparse image sets, we first construct a dense point cloud representation $P_{w}$ using a feed-forward deep 3D reconstruction model. Subsequently, the point cloud is reprojected onto the input camera poses $p_i$ to generate preliminary stitching images $\hat{S}_i$. While these images preserve the majority of the scene's structural information, they concurrently introduce substantial noise artifacts. To address this limitation, we train a point-conditioned denoising diffusion model to restore image fidelity, ultimately producing refined stitching images $S_i$.
  • Figure 3: Visual Comparison of 3D Reconstruction Results between COLMAP and VGGT. Reconstruction is performed under three distinct camera pose scenarios: pure rotation, slight parallax, and very large parallax. It can be observed that COLMAP produces significantly sparser point clouds across all scenarios, capturing limited structural information. In contrast, VGGT generates dense point clouds that preserve the majority of scene details.
  • Figure 4: Qualitative comparison of different image restoration diffusion models. We evaluated the performance of several state-of-the-art image restoration diffusion models in refining reprojected images, including DiffIRdiffir_xia2023diffir, DDRMddrm_kawar2022denoising, and RDDMliu2024residual.
  • Figure 5: The results of MetaShape reconstruction using stitched images. When used as inputs for 3D reconstruction in MetaShape, images stitched with APAP and ELA resulted in large number of surface holes. Images stitched with UDIS++ and OBJ-GSP exhibited significant distortions to geometric structure. In contrast, reconstructions generated from our stitched images consistently exhibited high geometric consistency and completeness within MetaShape. This enhanced structural integrity suggests that PIS3R is more suitable for downstream 3D reconstruction tasks.
  • ...and 5 more figures