Table of Contents
Fetching ...

Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang

TL;DR

This work addresses the challenge of recovering high-resolution, geometrically consistent 3D scenes from low-resolution inputs by coupling diffusion-based 2D super-resolution with a 3D Gaussian-splatting representation. The proposed 3DSR framework uses a diffusion-prior to generate HR views, then exploits a 3DGS to enforce cross-view coherence, updating latent representations iteratively to maintain 3D consistency. Evaluations on LLFF and MipNeRF360 show superior perceptual quality and improved 3D consistency (measured by MEt3R and FID) compared with ISR, VSR, and diffusion-based baselines, without fine-tuning diffusion models for video data. The results demonstrate that 3DSR achieves sharper textures, fewer cross-view artifacts, and structurally faithful reconstructions, enabling high-quality 3D super-resolution suitable for realistic novel view synthesis.

Abstract

We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions.

Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

TL;DR

This work addresses the challenge of recovering high-resolution, geometrically consistent 3D scenes from low-resolution inputs by coupling diffusion-based 2D super-resolution with a 3D Gaussian-splatting representation. The proposed 3DSR framework uses a diffusion-prior to generate HR views, then exploits a 3DGS to enforce cross-view coherence, updating latent representations iteratively to maintain 3D consistency. Evaluations on LLFF and MipNeRF360 show superior perceptual quality and improved 3D consistency (measured by MEt3R and FID) compared with ISR, VSR, and diffusion-based baselines, without fine-tuning diffusion models for video data. The results demonstrate that 3DSR achieves sharper textures, fewer cross-view artifacts, and structurally faithful reconstructions, enabling high-quality 3D super-resolution suitable for realistic novel view synthesis.

Abstract

We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions.

Paper Structure

This paper contains 13 sections, 9 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Qualitative comparison on the LLFF dataset with a downsampling factor of $\times8$ and upsampling of $\times4$. We train 3DGS with low-resolution (LR) inputs and with 3DSR (ours) and render test views. The results demonstrate that our method yields fewer artifacts from 3D inconsistencies and better preserves overall structural integrity compared to the baselines.
  • Figure 2: Motivation: Diffusion-based super-resolution (SR) methods enhance details in high-resolution (HR) images but fail to maintain 3D consistency across views. Given a set of low-resolution (LR) images (a), we apply StableSR wang2024exploiting to generate super-resolved images (b). However, the SR results introduce hallucinated details, as seen in the distorted steel strings of the bike, which appear misaligned and inconsistent. When rendered with 3DGS in (c), these inconsistencies lead to blurring and incorrect geometry. In the second row, the SR output (b) significantly deviates from the ground truth (d), demonstrating the instability of diffusion-based SR. Across multiple views, the inconsistencies in hallucinated textures further degrade 3DGS rendering, resulting in incorrect and incoherent textures.
  • Figure 3: Overview of one single sampling step. (a) The diffusion-based super-resolution method takes the latent representation $x_t^i$ of $i$th image at time step $t$ and the encoded low-resolution (LR) image latent $E^i$ as inputs to predict the clean latent $\hat{x}_0^i$ (b) The predicted clean latent $\hat{x}_0^i$ is then decoded into a high-resolution (HR) image $H^i$ , constrained by the LR image latents to ensure consistency. (c) The super-resolved images $H^i$ are subsequently used as inputs for a 3D Gaussian Splatting model pretrained on LR images. By leveraging the 3D representation, the rendered images $R^i$ are encouraged to exhibit improved 3D consistency, facilitating spatial fusion across different views. (d) After obtaining the rendered images $R^i$ from the 3D Gaussian Splatting model trained on SR images, they are encoded into the latent space as a 3D-consistent clean latent $\ddot{x}_0^i$ (e) The 3D-consistent latent $\ddot{x}_0^i$ is then input into the diffusion model along with the latent at time step $t$ , $x_t^i$ , to perform a denoising step, yielding the updated latent $x_{t-1}^i$.
  • Figure 4: Qualitative comparison on the LLFF dataset with a downsampling factor of $\times8$ and upsampling of $\times4$. We train 3DGS using different super-resolved (SR) images and render novel views. The results demonstrate that our method yields fewer artifacts from 3D inconsistencies and better preserves overall structural integrity compared to the baselines.
  • Figure 5: Qualitative comparison on the MipNeRF360 dataset with a downsampling factor of $\times16$ and upsampling of $\times4$. We train 3DGS using different super-resolved (SR) images and render novel views. To demonstrate the effectiveness of our method across diverse real-world scenarios, we present results on both indoor and outdoor scenes. Our results demonstrate that our approach preserves structural integrity and fine textures better than the baselines.