Table of Contents
Fetching ...

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

Xiang Feng, Xiangbo Wang, Tieshi Zhong, Chengkai Wang, Yiting Zhao, Tianxiang Xu, Zhenzhong Kuang, Feiwei Qin, Xuefei Yin, Yanming Zhu

TL;DR

This work proposes to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data.

Abstract

3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

TL;DR

This work proposes to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data.

Abstract

3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.
Paper Structure (22 sections, 9 equations, 9 figures, 5 tables)

This paper contains 22 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: We reformulate 3DGS-based 3DSR as a feed-forward mapping problem from sparse LR views to HR 3DGS representation. (a) Unlike existing methods that rely on dense multi-view inputs and per-scene 3DGS self-optimization, our method directly predicts HR 3DGS by a learned network from as few as two LR views. (b) This reformulation fundamentally changes how 3DSR acquires high-frequency knowledge. Instead of inheriting the limited priors embedded in 2DSR models, our SR3R learns a generalized cross-scene mapping function from large-scale multi-scene data, enabling the network to autonomously acquire the 3D-specific high-frequency structures required for accurate HR 3DGS reconstruction. The bottom row illustrates that our SR3R produces significantly sharp and faithful reconstructions.
  • Figure 2: Overview of the SR3R framework. Given two LR input views, a feed-forward 3DGS backbone produces an LR 3DGS, which is then densified via Gaussian Shuffle Split to form a structural scaffold. The LR views are upsampled and processed by our mapping network: a ViT encoder with feature refinement integrates LR 3DGS-aware cues, and a ViT decoder performs cross-view fusion. The Gaussian offset learning module then predicts residual offsets to the dense scaffold, yielding the final HR 3DGS for high-fidelity rendering.
  • Figure 3: Qualitative comparison with SOTA feed-forward 3DGS reconstruction methods on Re10k (top three) and ACID (bottom three) datasets. SR3R delivers significantly sharper details and more stable geometry than DepthSplat, NoPoSplat, and their upsampled variants, consistently improving reconstruction quality across different 3DGS backbones under sparse LR inputs.
  • Figure 4: Qualitative ablation results of SR3R components. Each component of SR3R progressively improves reconstruction quality, with upsampling reducing coarse blur, cross-attention improving feature alignment, Gaussian offset learning enhancing local geometry, and PTv3 yielding the sharpest and most consistent results.
  • Figure S1: Detailed Gaussian Offset Learning pipeline. Each Gaussian center is projected to the image plane to query local ViT features. The queried token is fused with a geometry-aware position embedding and processed by PTv3 blocks for spatial reasoning. A lightweight Gaussian Head predicts residual offsets to refine the initial 3DGS template.
  • ...and 4 more figures