Table of Contents
Fetching ...

BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, Jaejun Yoo

TL;DR

This work tackles Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) by addressing spectral bias and the limitations of coordinate-based INR encodings and pre-trained optical-flow networks. It introduces BF-STVSR, a flow-free framework comprising two axis-specific modules: Temporal B-spline Mapper for smooth motion interpolation and Spatial Fourier Mapper for capturing dominant spatial frequencies, enabling arbitrary time $t \in [0,1]$ and scale $s$ without external RAFT guidance. The method achieves state-of-the-art PSNR/SSIM and video-quality metrics across standard benchmarks, while reducing computational cost through learned motion from encoded features and forward warping, even without optical-flow supervision. This demonstrates that axis-specific, frequency-aware representations can robustly model spatio-temporal video structure for continuous interpolation with practical efficiency.

Abstract

While prior methods in Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow networks for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve--and even degrades--performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency. Our code is available https://github.com/Eunjnnn/bfstvsr.

BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

TL;DR

This work tackles Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) by addressing spectral bias and the limitations of coordinate-based INR encodings and pre-trained optical-flow networks. It introduces BF-STVSR, a flow-free framework comprising two axis-specific modules: Temporal B-spline Mapper for smooth motion interpolation and Spatial Fourier Mapper for capturing dominant spatial frequencies, enabling arbitrary time and scale without external RAFT guidance. The method achieves state-of-the-art PSNR/SSIM and video-quality metrics across standard benchmarks, while reducing computational cost through learned motion from encoded features and forward warping, even without optical-flow supervision. This demonstrates that axis-specific, frequency-aware representations can robustly model spatio-temporal video structure for continuous interpolation with practical efficiency.

Abstract

While prior methods in Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow networks for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve--and even degrades--performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency. Our code is available https://github.com/Eunjnnn/bfstvsr.
Paper Structure (22 sections, 6 equations, 6 figures, 4 tables)

This paper contains 22 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of BF-STVSR and results. (a) BF-STVSR captures the high-frequency spatial features by Fourier Mapper and interpolates temporal information smoothly via B-spline Mapper. (b) We visualize the changes of the interpolated frames over time $t$ for a selected x-axis (yellow vertical line in (a)).
  • Figure 2: Schematic overview of our BF-STVSR. (a) First, two input frames are encoded as low-resolution feature maps. Based on these features, Fourier Mapper predicts the dominant frequency information, while B-spline Mapper predicts smoothly interpolated motion representation, which is then processed into motion vectors at an arbitrary time $t$. The frequency information is temporally propagated by being warped with the predicted motion vectors. Finally, the warped feature is decoded to generate high-resolution interpolated RGB frame. (b) Fourier Mapper estimates the dominant frequencies and their amplitude to capture fine-detail information from the given frames. (c) B-spline Mapper estimates B-spline coefficients to model inherent motion, which smoothly interpolates motion features temporally.
  • Figure 3: Qualitative comparison on arbitrary scale temporal interpolation. "Overlap" refers to the averaged image of two input frames ($t=0,1$), and the following images are interpolated results at $t \in [0,1]$. (a) shows the interpolated results on in-distribution temporal scale ($\times 8$), used during training. (b) shows the interpolated results on out-of-distribution temporal scale ($\times 6$), not seen during training.
  • Figure 4: Qualitative comparison on the large out-of-distribution scale with a spatial scale of $\times\text{4}$ and a temporal scale of $\times\text{12}$. Three interpolation results at $t= \text{0.25, 0.5, 0.75}$ are shown with residual intensity maps compared to the ground truth frames.
  • Figure 5: Computational cost (left) and inference time (right) comparison on the spatial resolution of $1280\times 720$ with different temporal scale. All frames are spatially interpolated by a factor of ×4.
  • ...and 1 more figures