SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion
Yuseon Kim, Kyongseok Park
TL;DR
SRVP tackles the blurring and error accumulation in long-term video prediction by embedding two attention modules, SA and RFA, into a ConvGRU-based encoder–forecaster. This attention-driven fusion enhances both temporal context recall and spatial detail preservation, yielding sharper predictions while maintaining competitive performance with Transformer-based, non-recurrent models. Across Moving MNIST, KTH, and Human3.6M, SRVP reduces blur relative to strong RNN baselines and approaches or matches state-of-the-art RNN-free methods, with ablation showing RFA as the key contributor. The approach offers a practical balance between temporal modeling and spatial fidelity, with potential for further gains via segmentation-based refinements.
Abstract
Video prediction (VP) generates future frames by leveraging spatial representations and temporal context from past frames. Traditional recurrent neural network (RNN)-based models enhance memory cell structures to capture spatiotemporal states over extended durations but suffer from gradual loss of object appearance details. To address this issue, we propose the strong recollection VP (SRVP) model, which integrates standard attention (SA) and reinforced feature attention (RFA) modules. Both modules employ scaled dot-product attention to extract temporal context and spatial correlations, which are then fused to enhance spatiotemporal representations. Experiments on three benchmark datasets demonstrate that SRVP mitigates image quality degradation in RNN-based models while achieving predictive performance comparable to RNN-free architectures.
