Table of Contents
Fetching ...

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park

TL;DR

The paper tackles 3D super-resolution from low-resolution multi-view inputs by repurposing video super-resolution (VSR) models. It shows that artifacts and misalignment undermine VSR performance when fed LR renders from 3DGS, and sidesteps this by introducing simple yet effective sequence ordering: a greedy algorithm and an adaptive-length subsequence strategy with multi-thresholding, eliminating the need for VSR fine-tuning. With a sub-pixel regularization objective feeding into a 3DGS pipeline, the method achieves state-of-the-art results on NeRF-synthetic and Mip-NeRF-360 datasets while maintaining robustness to artifacts and non-smooth camera paths. This approach significantly reduces computational overhead and broadens the practicality of VSR-driven 3DSR in real-world multi-view scenarios.

Abstract

3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating 'smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: https://ko-lani.github.io/Sequence-Matters

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

TL;DR

The paper tackles 3D super-resolution from low-resolution multi-view inputs by repurposing video super-resolution (VSR) models. It shows that artifacts and misalignment undermine VSR performance when fed LR renders from 3DGS, and sidesteps this by introducing simple yet effective sequence ordering: a greedy algorithm and an adaptive-length subsequence strategy with multi-thresholding, eliminating the need for VSR fine-tuning. With a sub-pixel regularization objective feeding into a 3DGS pipeline, the method achieves state-of-the-art results on NeRF-synthetic and Mip-NeRF-360 datasets while maintaining robustness to artifacts and non-smooth camera paths. This approach significantly reduces computational overhead and broadens the practicality of VSR-driven 3DSR in real-world multi-view scenarios.

Abstract

3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating 'smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: https://ko-lani.github.io/Sequence-Matters

Paper Structure

This paper contains 31 sections, 7 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of stripy or blob-like artifacts generated in VSR outputs of LR videos rendered from 3DGS. 'VSR-Render' shows the VSR outputs of the LR rendered videos, while 'VSR-GT' displays the VSR outputs of the ground truth (GT) LR videos.
  • Figure 2: Overview of the proposed method. Given LR multi-view images, we generate subsequences (Sec. \ref{['sub:adaptive_subsequence_generation']}) starting from each image using a simple greedy algorithm (Sec. \ref{['sub:aligning_images_with_pose_and_feature']}) and these subsequences are bounded by multiple thresholds (Sec. \ref{['sub:multi_threshold_subsequence']}). Finally, we train a 3DGS model for 3D reconstruction using the upsampled HR images.
  • Figure 3: Illustration of subsequence generation. (a) is an unordered multi-view image dataset. (b) is the result of using a simple greedy algorithm, Alg. \ref{['alg:a_simple_greedy_algorithm']}. (c) highlights misalignments incurred by the algorithm, and we propose to split it into subsequences based on a pose difference threshold (red dotted line) between consecutive frames.
  • Figure 4: An example result from the simple greedy algorithm applied to the NeRF-synthetic dataset (Lego). Two neighboring images highlighted in red demonstrate abrupt transitions caused by misalignments.
  • Figure 5: Qualitative results on the NeRF-synthetic dataset. The PSNR values against GT are embedded in each image patch. Ours have shown superior results than the existing baselines, especially for high-frequency details.
  • ...and 5 more figures