Table of Contents
Fetching ...

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet Gandhi

TL;DR

The paper tackles efficient video saliency prediction by introducing ViNet-S, a compact 9M-parameter model with a lightweight decoder, and ViNet-A, which employs a spatio-temporal action localization (STAL) backbone (SlowFast) for improved performance on human-centric content. An ensemble, ViNet-E, averages predictions from ViNet-S and ViNet-A to achieve state-of-the-art results across nine datasets, outperforming transformer-based methods while maintaining real-time inference speeds (ViNet-S >1000fps in batch mode). ViNet-S, ViNet-A, and ViNet-E demonstrate robust performance on both visual-only and audio-visual saliency benchmarks, with findings that audio cues offer limited practical gains for saliency. The work highlights the value of combining global action localization with efficient decoders, offering a scalable path toward real-time, resource-efficient saliency systems for diverse applications.

Abstract

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

TL;DR

The paper tackles efficient video saliency prediction by introducing ViNet-S, a compact 9M-parameter model with a lightweight decoder, and ViNet-A, which employs a spatio-temporal action localization (STAL) backbone (SlowFast) for improved performance on human-centric content. An ensemble, ViNet-E, averages predictions from ViNet-S and ViNet-A to achieve state-of-the-art results across nine datasets, outperforming transformer-based methods while maintaining real-time inference speeds (ViNet-S >1000fps in batch mode). ViNet-S, ViNet-A, and ViNet-E demonstrate robust performance on both visual-only and audio-visual saliency benchmarks, with findings that audio cues offer limited practical gains for saliency. The work highlights the value of combining global action localization with efficient decoders, offering a scalable path toward real-time, resource-efficient saliency systems for diverse applications.

Abstract

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Paper Structure

This paper contains 18 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our Model (ViNet-A) Architecture for SP (Best viewed in colour)
  • Figure 2: Qualitative results: Comparing Ground Truth with the predicted saliency maps of our models and STSANet on three different datasets - DHF1K, UCF-Sports and DIEM.