Table of Contents
Fetching ...

SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

Yucheng Mao, Boyang Wang, Nilesh Kulkarni, Jeong Joon Park

TL;DR

SIR-Diff introduces a multi-view diffusion framework that jointly restores degraded images from multiple views of the same scene while enforcing 3D consistency. By extending latent diffusion with a Spatial-3D Integrated UNet and a 3D self-attention transformer, the model effectively fuses information across views and preserves geometric coherence. Across motion deblurring, sparse-view super-resolution, and downstream 3D tasks, SIR-Diff outperforms single-view and video-based baselines, and demonstrates strong zero-shot generalization to unseen multi-view datasets. The work highlights the practical impact of robust multi-view restoration for 3D reconstruction, pose estimation, and correspondence learning, suggesting future directions in cross-image attention and explicit 3D-consistency guarantees.

Abstract

The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.

SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

TL;DR

SIR-Diff introduces a multi-view diffusion framework that jointly restores degraded images from multiple views of the same scene while enforcing 3D consistency. By extending latent diffusion with a Spatial-3D Integrated UNet and a 3D self-attention transformer, the model effectively fuses information across views and preserves geometric coherence. Across motion deblurring, sparse-view super-resolution, and downstream 3D tasks, SIR-Diff outperforms single-view and video-based baselines, and demonstrates strong zero-shot generalization to unseen multi-view datasets. The work highlights the practical impact of robust multi-view restoration for 3D reconstruction, pose estimation, and correspondence learning, suggesting future directions in cross-image attention and explicit 3D-consistency guarantees.

Abstract

The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.

Paper Structure

This paper contains 41 sections, 2 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Sparse View Image Restoration. Our diffusion model takes multi-view images and jointly enhances their visual quality while maintaining 3D consistency. (Top) Four motion-blurred input images are processed by our method, resulting in sharp outputs that significantly outperform single-view restoration methods as shown in the corresponding purple boxes. (Bottom) Our method can consistently restore multi-view images (4 out of 50 shown), leading to accurate 3D reconstructions.
  • Figure 2: Method Overview. Our model is a UNet-based latent diffusion model that takes as input a degraded sparse image set and outputs a set of restored consistent images. The core of our approach is a diffusion model that jointly denoises image latents across multiple views by relying on our spatial-3D ResNet and 3D Self-Attn transformer. We use pre-trained encoder-decoders from SD2.1 rombach2022high and train our denoising UNet on those encoded latents. We show view-consistent results for the task of Deblurring and Super-Resolution.
  • Figure 3: Qualitative Comparisons on Motion Deblurring and Super-Resolution. The left side is the Motion-Deblurring and the right side is the Super-Resolution results. We refer readers to supplementary for more visualizations. Zoom in for the best view.
  • Figure 4: Feature Matching Downstream Application. Low-Resolution and Single-image-based SR models like OSEDiff wu2024one find less correspondence points than our SIR-Diff that can restore multiview images consistently. SIR-Diff recovers the highest number of correspondences ($310$) as compared to other baseline restoration methods.
  • Figure 5: BAD-GS zhao2024bad with SIR-Diff. We show the effect of using SIR-Diff to help BAD-GS recover from catastrophic failure. BAD-GS (Top) output with a very high LPIPS$\downarrow$ of 0.59 vs.0.22 BAD-GS output with image-restoration using SIR-Diff (Bottom) leads to a significantly higher-quality 3D reconstruction.
  • ...and 14 more figures