SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model
Yucheng Mao, Boyang Wang, Nilesh Kulkarni, Jeong Joon Park
TL;DR
SIR-Diff introduces a multi-view diffusion framework that jointly restores degraded images from multiple views of the same scene while enforcing 3D consistency. By extending latent diffusion with a Spatial-3D Integrated UNet and a 3D self-attention transformer, the model effectively fuses information across views and preserves geometric coherence. Across motion deblurring, sparse-view super-resolution, and downstream 3D tasks, SIR-Diff outperforms single-view and video-based baselines, and demonstrates strong zero-shot generalization to unseen multi-view datasets. The work highlights the practical impact of robust multi-view restoration for 3D reconstruction, pose estimation, and correspondence learning, suggesting future directions in cross-image attention and explicit 3D-consistency guarantees.
Abstract
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
