Table of Contents
Fetching ...

RealViformer: Investigating Attention for Real-World Video Super-Resolution

Yuehan Zhang, Angela Yao

TL;DR

RealViformer investigates attention in real-world video super-resolution by comparing covariance-based spatial and channel attentions under real degradations. It demonstrates that channel attention is more robust to artifact-laden queries but tends to increase output-channel covariance, which can hinder learning; this is mitigated with the CAF and ICA modules that incorporate squeeze-and-excite and covariance-based rescaling. The model uses a unidirectional recurrent Transformer with CAF for temporal fusion and ICA for enhanced channel processing, achieving state-of-the-art results with fewer parameters and faster runtimes on multiple real-world and synthetic datasets. This work provides practical guidance on attention design for RWVSR and introduces design patterns to control channel redundancy, offering a path toward more reliable real-world video enhancement systems.

Abstract

In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used spatial attention, which computes covariance over space, versus the channel attention, we observe that the latter is less sensitive to artifacts. However, channel attention leads to feature redundancy, as evidenced by the higher covariance among output channels. As such, we explore simple techniques such as the squeeze-excite mechanism and covariance-based rescaling to counter the effects of high channel covariance. Based on our findings, we propose RealViformer. This channel-attention-based real-world VSR framework surpasses state-of-the-art on two real-world VSR datasets with fewer parameters and faster runtimes. The source code is available at https://github.com/Yuehan717/RealViformer.

RealViformer: Investigating Attention for Real-World Video Super-Resolution

TL;DR

RealViformer investigates attention in real-world video super-resolution by comparing covariance-based spatial and channel attentions under real degradations. It demonstrates that channel attention is more robust to artifact-laden queries but tends to increase output-channel covariance, which can hinder learning; this is mitigated with the CAF and ICA modules that incorporate squeeze-and-excite and covariance-based rescaling. The model uses a unidirectional recurrent Transformer with CAF for temporal fusion and ICA for enhanced channel processing, achieving state-of-the-art results with fewer parameters and faster runtimes on multiple real-world and synthetic datasets. This work provides practical guidance on attention design for RWVSR and introduces design patterns to control channel redundancy, offering a path toward more reliable real-world video enhancement systems.

Abstract

In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used spatial attention, which computes covariance over space, versus the channel attention, we observe that the latter is less sensitive to artifacts. However, channel attention leads to feature redundancy, as evidenced by the higher covariance among output channels. As such, we explore simple techniques such as the squeeze-excite mechanism and covariance-based rescaling to counter the effects of high channel covariance. Based on our findings, we propose RealViformer. This channel-attention-based real-world VSR framework surpasses state-of-the-art on two real-world VSR datasets with fewer parameters and faster runtimes. The source code is available at https://github.com/Yuehan717/RealViformer.
Paper Structure (14 sections, 3 equations, 11 figures, 3 tables)

This paper contains 14 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: (a) Designing a RWVSR transformer is not trivial. A Swin-based transformer suited for standard VSR hallucinates more lines than a RealBasicVSR, a convolutional state-of-the-art. We propose RealViformer based on our investigation of attention under the RWVSR setting. RealViformer generates details with fewer artifacts than RealBasicVSR chan2021basicvsr and the Swin-based VSR model. (b) Schematic for spatial and channel attention. Spatial attention aggregates features based on pixel representations. Channel attention takes $H\times W$ feature map for matching across channels.
  • Figure 2: Schematic for sensitivity comparison. ${I_{t-1}, I_{t}}$ are downsampled but clean frames at times $t$ and $t-1$. $D_i(.)$ apply degradations to $I_t$, where $D_i \in \text{\{blur, noise, compression\}}$. $O$ and $O_{D_i}$ are output features of the attention module. Queries are from the embedding at time $t$, and keys and values are from time $t-1$. Higher cosine similarities $S$ between attention output features $O$ and $O_{D_i}$ reflect less sensitivity to artifacts in queries.
  • Figure 3: (a) The recurrent baseline in \ref{['sec:investigation']} has a shallow mapping module $\mathcal{F}$, reconstruction module $\mathcal{R}$, upsampling module $\mathcal{U}$ and warping function $W$. $W$ aligns the hidden state $h_{t-1}$ to feature at $t$ based on optical flow $s^f_{(t-1)\rightarrow\!t}$. All residual blocks are convolutional. The concatenation between $f_t$ and $\hat{h}_{t-1}$ are replaced with the spatial or channel attention modules in (b) to compare the effect of attention. (b) The attention module first applies layer normalization to $f_{t}$ and $\hat{h}_{t-1}$ and then performs channel or spatial attention according to \ref{['sec:attndefs']}. The output feature $O^{A}_{t}$ concatenated with $f_t$ is processed by the module $\mathcal{R}$ in (a).
  • Figure 4: Comparison of spatial and channel attention through impact on the performance of real-world VSR model. The Y-axis shows improvements compared to the convolutional baseline. A lower LPIPS score is better. The channel attention module is the best except for the PSNR score of highly blurred inputs.
  • Figure 5: Improved Channel Attention Module (ICA), showing self-attention for simplicity. The 'squeeze' convolution compresses the number of input feature channels $X\in\mathbb{R}^{C\times\!H\times\!W}$ by ratio $r$. The features are then rescaled by weights predicted from the $\frac{C}{r}\times\frac{C}{r}$ attention map before being expanded by the 'excite' convolution back to the original number of input channels.
  • ...and 6 more figures