Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding
Morteza Moradi, Simone Palazzo, Concetto Spampinato
TL;DR
The paper addresses video saliency prediction by leveraging spatio-temporal transformers while tackling how to best utilize temporal features during decoding. It introduces THTD-Net, which employs a Video Swin Transformer encoder and a deep single decoder that maintains a high temporal dimension throughout decoding, avoiding multi-branch architectures. The training objective combines a linear correlation coefficient loss and a KL-divergence loss, formulated as $L(S,G)=L_{CC}(S,G)+L_{KL}(S,G)$ with $L_{CC}(S,G)=-\frac{cov(S,G)}{\rho(S)\rho(G)}$ and $L_{KL}(S,G)=\sum_x G(x)\log\frac{G(x)}{S(x)}$, optimized by Adam at $10^{-5}$ with batch size 1. Empirically, THTD-Net achieves competitive performance on DHF1K and comparable results on Hollywood-2 and UCF-Sports, with a compact 220 MB model, and ablations show that longer decoders and preserving temporal richness in decoding are beneficial while excessive depth or early temporal downsampling can hurt performance.
Abstract
In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
