Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks
Hiroshi Mori, Norimichi Ukita
TL;DR
This work tackles the instability of RNN-based video super-resolution (VSR) when applied to long and dynamic videos due to domain gaps from training on short clips. It introduces a video-agnostic training framework combining Truncated Partial-Initialization BPTT (PI-BPTT) with frame-number conditioning to stabilize long-video VSR while controlling memory usage through hidden-state reuse. PI-BPTT stores hidden states computed from all frames with small spatial crops and reuses them during training with a controllable recurrence length $l$ and repetition factor $R$, balancing memory efficiency and accuracy. Frame-number conditioning augments each LR input with a normalized frame index, enabling the model to adapt its processing to varying difficulty levels tied to sequence length and dynamics. Experimental results on Vimeo, REDStrain, Vid4, REDS4, and quasi-static long videos show consistent PSNR/SSIM gains over RI-BPTT baselines for FRVSR and BasicVSR, with the largest improvements occurring in long or complex sequences, while highlighting a manageable trade-off between training time and accuracy when tuning $R$.
Abstract
A Recurrent Neural Network (RNN) for Video Super Resolution (VSR) is generally trained with randomly clipped and cropped short videos extracted from original training videos due to various challenges in learning RNNs. However, since this RNN is optimized to super-resolve short videos, VSR of long videos is degraded due to the domain gap. Our preliminary experiments reveal that such degradation changes depending on the video properties, such as the video length and dynamics. To avoid this degradation, this paper proposes the training strategy of RNN for VSR that can work efficiently and stably independently of the video length and dynamics. The proposed training strategy stabilizes VSR by training a VSR network with various RNN hidden states changed depending on the video properties. Since computing such a variety of hidden states is time-consuming, this computational cost is reduced by reusing the hidden states for efficient training. In addition, training stability is further improved with frame-number conditioning. Our experimental results demonstrate that the proposed method performed better than base methods in videos with various lengths and dynamics.
