Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Morteza Moradi; Simone Palazzo; Concetto Spampinato

Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Morteza Moradi, Simone Palazzo, Concetto Spampinato

TL;DR

The paper addresses video saliency prediction by leveraging spatio-temporal transformers while tackling how to best utilize temporal features during decoding. It introduces THTD-Net, which employs a Video Swin Transformer encoder and a deep single decoder that maintains a high temporal dimension throughout decoding, avoiding multi-branch architectures. The training objective combines a linear correlation coefficient loss and a KL-divergence loss, formulated as $L(S,G)=L_{CC}(S,G)+L_{KL}(S,G)$ with $L_{CC}(S,G)=-\frac{cov(S,G)}{\rho(S)\rho(G)}$ and $L_{KL}(S,G)=\sum_x G(x)\log\frac{G(x)}{S(x)}$, optimized by Adam at $10^{-5}$ with batch size 1. Empirically, THTD-Net achieves competitive performance on DHF1K and comparable results on Hollywood-2 and UCF-Sports, with a compact 220 MB model, and ablations show that longer decoders and preserving temporal richness in decoding are beneficial while excessive depth or early temporal downsampling can hurt performance.

Abstract

In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.

Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

TL;DR

with

and

, optimized by Adam at

with batch size 1. Empirically, THTD-Net achieves competitive performance on DHF1K and comparable results on Hollywood-2 and UCF-Sports, with a compact 220 MB model, and ablations show that longer decoders and preserving temporal richness in decoding are beneficial while excessive depth or early temporal downsampling can hurt performance.

Abstract

Paper Structure (11 sections, 4 equations, 2 figures, 3 tables)

This paper contains 11 sections, 4 equations, 2 figures, 3 tables.

INTRODUCTION
RELATED WORK
METHOD
Model Architecture
Training Objective
EXPERIMENTS
Datasets
Experimental Setup
Result Analysis
Ablation Study
CONCLUSION

Figures (2)

Figure 1: Overview of the proposed video saliency prediction model. The output channel dimensions of the decoder layers are reported in the figure.
Figure 2: Qualitative comparison of the performance of different video saliency prediction models.

Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

TL;DR

Abstract

Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (2)