Table of Contents
Fetching ...

SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

Xinyuan Hu, Changyue Shi, Chuxiao Yang, Minghao Chen, Jiajun Ding, Tao Wei, Chen Wei, Zhou Yu, Min Tan

TL;DR

SRSplat tackles high-quality 3D reconstruction from sparse, low-resolution views by leveraging external scene knowledge and internal texture cues. It constructs a scene-specific reference gallery via multimodal language models and diffusion priors, then fuses LR and reference features with RGFE and enhances texture through Texture-Aware Density Control. A depth-guided Gaussian decoder predicts refined Gaussian primitives, with adaptive density driven by a texture richness percepton. Across RealEstate10K, ACID, and DTU, SRSplat achieves state-of-the-art results and robust cross-dataset and cross-resolution generalization, while enabling real-time, feed-forward inference for practical deployment.

Abstract

Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textit{RGFE}. To further refine predicted Gaussian primitives, we introduce \textit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

TL;DR

SRSplat tackles high-quality 3D reconstruction from sparse, low-resolution views by leveraging external scene knowledge and internal texture cues. It constructs a scene-specific reference gallery via multimodal language models and diffusion priors, then fuses LR and reference features with RGFE and enhances texture through Texture-Aware Density Control. A depth-guided Gaussian decoder predicts refined Gaussian primitives, with adaptive density driven by a texture richness percepton. Across RealEstate10K, ACID, and DTU, SRSplat achieves state-of-the-art results and robust cross-dataset and cross-resolution generalization, while enabling real-time, feed-forward inference for practical deployment.

Abstract

Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textit{RGFE}. To further refine predicted Gaussian primitives, we introduce \textit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

Paper Structure

This paper contains 14 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: In this work, we propose SRSplat, a novel feed-forward framework that reconstructs high-quality 3D scenes with only sparse and LR input views. SRSplat demonstrates superior performance and is capable of handling low-resolution, sparse-view inputs and real-time reconstruction, thereby offering greater functionality and practicality in realistic applications.
  • Figure 2: Framework of SRSplat. Our method takes LR images and their corresponding reference as inputs. The RGFE module first extracts multi-scale features and effectively fuses these features. Upon decoding the Gaussian primitives, TADC adjusts Gaussian density adaptively according to the richness of texture generated by a texture richness perceptron.
  • Figure 3: Pipeline of reference gallery generation. Given LR input images for each scene, the MLLM produces semantic descriptions. Subsequently, the diffusion model uses these descriptions to generate reference images tailored to the scene.
  • Figure 4: Reference gallery examples. Reference twin images share details similar to the LR images.
  • Figure 5: Error maps show the intensity of inconsistency between the rendered image and the ground truth. We observe that Regions with high texture richness are often under-optimized. Therefore, we propose TADC dynamically control the density of Gaussians according to texture richness.
  • ...and 3 more figures