Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues
Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet Gandhi
TL;DR
The paper tackles efficient video saliency prediction by introducing ViNet-S, a compact 9M-parameter model with a lightweight decoder, and ViNet-A, which employs a spatio-temporal action localization (STAL) backbone (SlowFast) for improved performance on human-centric content. An ensemble, ViNet-E, averages predictions from ViNet-S and ViNet-A to achieve state-of-the-art results across nine datasets, outperforming transformer-based methods while maintaining real-time inference speeds (ViNet-S >1000fps in batch mode). ViNet-S, ViNet-A, and ViNet-E demonstrate robust performance on both visual-only and audio-visual saliency benchmarks, with findings that audio cues offer limited practical gains for saliency. The work highlights the value of combining global action localization with efficient decoders, offering a scalable path toward real-time, resource-efficient saliency systems for diverse applications.
Abstract
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
