Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Hiroshi Mori; Norimichi Ukita

Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Hiroshi Mori, Norimichi Ukita

TL;DR

This work tackles the instability of RNN-based video super-resolution (VSR) when applied to long and dynamic videos due to domain gaps from training on short clips. It introduces a video-agnostic training framework combining Truncated Partial-Initialization BPTT (PI-BPTT) with frame-number conditioning to stabilize long-video VSR while controlling memory usage through hidden-state reuse. PI-BPTT stores hidden states computed from all frames with small spatial crops and reuses them during training with a controllable recurrence length $l$ and repetition factor $R$, balancing memory efficiency and accuracy. Frame-number conditioning augments each LR input with a normalized frame index, enabling the model to adapt its processing to varying difficulty levels tied to sequence length and dynamics. Experimental results on Vimeo, REDStrain, Vid4, REDS4, and quasi-static long videos show consistent PSNR/SSIM gains over RI-BPTT baselines for FRVSR and BasicVSR, with the largest improvements occurring in long or complex sequences, while highlighting a manageable trade-off between training time and accuracy when tuning $R$.

Abstract

A Recurrent Neural Network (RNN) for Video Super Resolution (VSR) is generally trained with randomly clipped and cropped short videos extracted from original training videos due to various challenges in learning RNNs. However, since this RNN is optimized to super-resolve short videos, VSR of long videos is degraded due to the domain gap. Our preliminary experiments reveal that such degradation changes depending on the video properties, such as the video length and dynamics. To avoid this degradation, this paper proposes the training strategy of RNN for VSR that can work efficiently and stably independently of the video length and dynamics. The proposed training strategy stabilizes VSR by training a VSR network with various RNN hidden states changed depending on the video properties. Since computing such a variety of hidden states is time-consuming, this computational cost is reduced by reusing the hidden states for efficient training. In addition, training stability is further improved with frame-number conditioning. Our experimental results demonstrate that the proposed method performed better than base methods in videos with various lengths and dynamics.

Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

TL;DR

and repetition factor

, balancing memory efficiency and accuracy. Frame-number conditioning augments each LR input with a normalized frame index, enabling the model to adapt its processing to varying difficulty levels tied to sequence length and dynamics. Experimental results on Vimeo, REDStrain, Vid4, REDS4, and quasi-static long videos show consistent PSNR/SSIM gains over RI-BPTT baselines for FRVSR and BasicVSR, with the largest improvements occurring in long or complex sequences, while highlighting a manageable trade-off between training time and accuracy when tuning

Abstract

Paper Structure (15 sections, 11 figures, 4 tables)

This paper contains 15 sections, 11 figures, 4 tables.

Introduction
Related Work
VSR Networks DBLP:conf/cvpr/NahTGBHMSL19DBLP:conf/eccv/FuoliHGTREKXLXW20
Training Stability in RNN-based VSR
Preliminary Experiments
Backpropagation-Through-Time and its Problems for Video Processing
Proposed Method
Truncated Partial-Initialization BPTT
Frame-number Conditioning for Difficulty-dependent VSR
Experimental Results
Details
Results
Quantitative and Qualitative Results
Trade-off between Efficiency and Accuracy
Concluding Remarks

Figures (11)

Figure 1: Difference between previous DBLP:conf/cvpr/SajjadiVB18DBLP:conf/cvpr/ChanWYDL21 and our VSR methods.
Figure 2: The effects of the video length and texture density. $t$ denotes the frame number in which images in the same column are reconstructed.
Figure 3: The effects of the motion magnitude and object disappearances. $s$ denotes the pixel number of the window sliding. Each red rectangle indicates the overlap between the sliding windows (i.e., pixels observed in all the sliding windows).
Figure 4: Preliminary experimental results. The effects of the intensity change. All these images are $t = 300$ frames.
Figure 5: Random-Initialization BPTT for videos.
...and 6 more figures

Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

TL;DR

Abstract

Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)