Table of Contents
Fetching ...

MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance

Kaizhe Zhang, Shinan Chen, Qian Zhao, Weizhan Zhang, Caixia Yan, Yudeng Xin

TL;DR

3D Gaussian Splatting (3DGS) struggles to render high-resolution views from low-resolution inputs. The authors propose MVGSR, a multi-view SR framework that uses camera-pose-based auxiliary view selection and an epipolar-constrained multi-view attention to fuse information across views for high-frequency detail and geometric consistency. The SR network combines multi-view features with a single-image prior and uses a sub-pixel, anti-aliased loss to supervise 3DGS rendering. Experiments across NeRF Synthetic, Tanks & Temples, and Mip-NeRF 360 demonstrate state-of-the-art performance on object-centric and scene-level 3DGS SR benchmarks, with improved cross-view consistency and detail fidelity. MVGSR enables HRNVS on arbitrarily organized multi-view data without strict temporal continuity or view ordering, offering practical benefits for real-world multi-view capture scenarios.

Abstract

Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.

MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance

TL;DR

3D Gaussian Splatting (3DGS) struggles to render high-resolution views from low-resolution inputs. The authors propose MVGSR, a multi-view SR framework that uses camera-pose-based auxiliary view selection and an epipolar-constrained multi-view attention to fuse information across views for high-frequency detail and geometric consistency. The SR network combines multi-view features with a single-image prior and uses a sub-pixel, anti-aliased loss to supervise 3DGS rendering. Experiments across NeRF Synthetic, Tanks & Temples, and Mip-NeRF 360 demonstrate state-of-the-art performance on object-centric and scene-level 3DGS SR benchmarks, with improved cross-view consistency and detail fidelity. MVGSR enables HRNVS on arbitrarily organized multi-view data without strict temporal continuity or view ordering, offering practical benefits for real-world multi-view capture scenarios.

Abstract

Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An overview of the proposed MVGSR pipeline. Given a set of LR images and their corresponding camera poses estimated via COLMAP, we first select auxiliary views based on camera pose. The selected auxiliary views, together with the target LR image, are fed into a multi-view SR network. It employs an epipolar-constrained multi-view attention mechanism to extract consistent and complementary high-frequency details from the auxiliary views. The resulting super-resolved images, together with the original LR images, are then used to jointly train the 3DGS.
  • Figure 2: The architecture of the Multi-View SR Network. The whole network consists of the MVFE Module, the SIP Module, and the MSFF Module. The LR target and auxiliary images are taken as input for MVFE to extract multi-view features. The MVFE consists of 3 RET blocks at different scales, each integrated with an EST module employing epipolar-constrained multi-view attention. Combined with the single-image deep prior by the SIP module, the target image is effectively restored by fully fusing the single-image feature with the multi-view feature.
  • Figure 3: Epipolar-Constrained Multi-View Attention
  • Figure 4: Qualitative comparisons on NeRF Synthetic ×4 datasets. MVGSR produces more visually appealing results, successfully capturing high-frequency details and textures. Best viewed at screen!
  • Figure 5: Qualitative comparisons on Tanks & Temples dataset of 240×135 → 960×540 task. MVGSR consistently restores coherent structures and intricate details. Best viewed at screen!
  • ...and 2 more figures