Table of Contents
Fetching ...

Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

Alexander Becker, Julius Erbach, Dominik Narnhofer, Konrad Schindler

TL;DR

The paper addresses the challenge of continuous space-time video super-resolution by introducing Video Fourier Field (VFF), a finite sum of 3D sinusoids that models a video as hat $V(x,y,t) = \sum_{i=1}^N a_i \sin(\bm{\omega}_i \cdot (x,y,t) + \phi_i)$. A neural encoder with a large spatio-temporal receptive field predicts voxel-wise amplitudes and phases, while a shared set of frequencies enables coherent, warp-free reconstruction. The method supports sampling at arbitrary spatio-temporal coordinates and incorporates a Gaussian PSF for anti-aliasing, yielding competitive or state-of-the-art PSNR/SSIM across multiple benchmarks and tasks (AVSR, VFI, and general C-STVSR) with improved temporal consistency and efficiency. This unified, continuous representation reduces reliance on explicit motion warping, enables long-range temporal context, and demonstrates practical impact for high-quality, flexible video enhancement at arbitrary scales. $V(x,y,t)$ can be sampled efficiently at any resolution, and the approach scales with model size and context to further improve results.$

Abstract

We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: https://v3vsr.github.io.

Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

TL;DR

The paper addresses the challenge of continuous space-time video super-resolution by introducing Video Fourier Field (VFF), a finite sum of 3D sinusoids that models a video as hat . A neural encoder with a large spatio-temporal receptive field predicts voxel-wise amplitudes and phases, while a shared set of frequencies enables coherent, warp-free reconstruction. The method supports sampling at arbitrary spatio-temporal coordinates and incorporates a Gaussian PSF for anti-aliasing, yielding competitive or state-of-the-art PSNR/SSIM across multiple benchmarks and tasks (AVSR, VFI, and general C-STVSR) with improved temporal consistency and efficiency. This unified, continuous representation reduces reliance on explicit motion warping, enables long-range temporal context, and demonstrates practical impact for high-quality, flexible video enhancement at arbitrary scales. can be sampled efficiently at any resolution, and the approach scales with model size and context to further improve results.$

Abstract

We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: https://v3vsr.github.io.

Paper Structure

This paper contains 19 sections, 4 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Performance vs. computation time. V3 outperforms other VSR models by about 2$\,$dB PSNR, while being significantly faster. PSNR is measured on the Adobe240 test set, for $\times 4$ spatial and $\times 8$ temporal SR. For compute see Sec. \ref{['sec:compute']}.
  • Figure 2: Overview of V3. A backbone encoder predicts a voxel grid of local phase shifts and weighting coefficients for a set of 3D Fourier basis functions. Their sum describes, within a local interval, the continuous function $\hat{V}(x,y,t)$ that we call the Video Fourier Field. The function can be sampled at different spatio-temporal resolutions (Eq. \ref{['eq:psf']}) to obtain an output video.
  • Figure 3: Qualitative comparison of C-STVSR methods ($\times$4 spatial, $\times$8 temporal). V3 recovers legible text as well as thin stripe patterns.
  • Figure 4: Frame interpolation ($\times8$, center frames). V3 faithfully recovers high frequency content.
  • Figure 5: Temporal consistency. The red rectangle corresponds to a vertical image column over time. V3 faithfully reconstructs complex, non-linear image motion and reduces block artifacts.
  • ...and 6 more figures