Table of Contents
Fetching ...

Enhancing Video Super-Resolution via Implicit Resampling-based Alignment

Kai Xu, Ziwei Yu, Xin Wang, Michael Bi Mi, Angela Yao

TL;DR

Video super-resolution benefits from accurate temporal alignment, but conventional resampling during alignment tends to blur high-frequency details. The paper introduces implicit resampling-based alignment using a coordinate network with sinusoidal positional encoding and window-based cross-attention to encode sub-pixel information without imposing low-pass constraints. This approach generalizes across feature scales and alignment configurations and yields state-of-the-art or competitive results on synthetic and real VSR datasets with minimal computational overhead. The findings emphasize that preserving the reference spectrum during alignment and avoiding coordinate quantization errors can substantially improve VSR performance, with broader implications for robust, scalable video restoration.

Abstract

In video super-resolution, it is common to use a frame-wise alignment to support the propagation of information over time. The role of alignment is well-studied for low-level enhancement in video, but existing works overlook a critical step -- resampling. We show through extensive experiments that for alignment to be effective, the resampling should preserve the reference frequency spectrum while minimizing spatial distortions. However, most existing works simply use a default choice of bilinear interpolation for resampling even though bilinear interpolation has a smoothing effect and hinders super-resolution. From these observations, we propose an implicit resampling-based alignment. The sampling positions are encoded by a sinusoidal positional encoding, while the value is estimated with a coordinate network and a window-based cross-attention. We show that bilinear interpolation inherently attenuates high-frequency information while an MLP-based coordinate network can approximate more frequencies. Experiments on synthetic and real-world datasets show that alignment with our proposed implicit resampling enhances the performance of state-of-the-art frameworks with minimal impact on both compute and parameters.

Enhancing Video Super-Resolution via Implicit Resampling-based Alignment

TL;DR

Video super-resolution benefits from accurate temporal alignment, but conventional resampling during alignment tends to blur high-frequency details. The paper introduces implicit resampling-based alignment using a coordinate network with sinusoidal positional encoding and window-based cross-attention to encode sub-pixel information without imposing low-pass constraints. This approach generalizes across feature scales and alignment configurations and yields state-of-the-art or competitive results on synthetic and real VSR datasets with minimal computational overhead. The findings emphasize that preserving the reference spectrum during alignment and avoiding coordinate quantization errors can substantially improve VSR performance, with broader implications for robust, scalable video restoration.

Abstract

In video super-resolution, it is common to use a frame-wise alignment to support the propagation of information over time. The role of alignment is well-studied for low-level enhancement in video, but existing works overlook a critical step -- resampling. We show through extensive experiments that for alignment to be effective, the resampling should preserve the reference frequency spectrum while minimizing spatial distortions. However, most existing works simply use a default choice of bilinear interpolation for resampling even though bilinear interpolation has a smoothing effect and hinders super-resolution. From these observations, we propose an implicit resampling-based alignment. The sampling positions are encoded by a sinusoidal positional encoding, while the value is estimated with a coordinate network and a window-based cross-attention. We show that bilinear interpolation inherently attenuates high-frequency information while an MLP-based coordinate network can approximate more frequencies. Experiments on synthetic and real-world datasets show that alignment with our proposed implicit resampling enhances the performance of state-of-the-art frameworks with minimal impact on both compute and parameters.
Paper Structure (17 sections, 12 equations, 6 figures, 5 tables)

This paper contains 17 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparisons with super-resolved outcomes employing nearest-neighbor interpolation, bilinear and bicubic resampling. The red arrow highlights smoothing effects for bilinear and bicubic interpolation, while the blue arrow highlights the ragged edge.
  • Figure 1: Comparisons on feature alignment. Implicit Resampling-based Alignment (IA) outperforms all three state-of-the-art alignment methods.
  • Figure 2: (a). Motion estimation provides a transformation that maps the reference frame $\mathbf{X}_r$ to the current frame $\mathbf{X}_t$. Compensation performs resampling on $\mathbf{X}_r$ to obtain the aligned value $\mathbf{X}_a[x,y]$ at each pixel location. (b) The estimated motion offsets are decomposed into integral offsets and decimal offsets. The integral offsets are used for window queries and the decimal offsets are used for position encoding for the query pixel $X_t$. The features along with the positional encodings are modeled with coordinate networks, and the aligned pixel $X_a$ is obtained by a cross-attention mechanism.
  • Figure 4: Qualitative comparisons on REDS4 dataset. IA-CNN provides more details on the wall and more uniform patterns on the window.
  • Figure 5: Qualitative comparisons on REDS4 and Vid4. IA-RT provides sharper results and more fine-grained patterns.
  • ...and 1 more figures