TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction
Bahar Aydemir, Ludo Hoffstetter, Tong Zhang, Mathieu Salzmann, Sabine Süsstrunk
TL;DR
TempSAL addresses the lack of temporal dynamics in saliency prediction by modeling attention as time-resolved trajectories. It introduces a temporal-slice decoder and a spatiotemporal mixing module that jointly predict time-specific saliency maps and refine an image saliency map, leveraging a multi-level encoder (PNASNet-5). Across SALICON and CodeCharts1k, TempSAL outperforms state-of-the-art models and a multi-duration baseline, demonstrating that explicit temporal information improves accuracy. The approach also provides insights into gaze evolution, suggesting practical benefits for design, advertising, and user experience applications where temporal attention matters.
Abstract
Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information, such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals by exploiting human temporal attention patterns. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our experiments show that our method outperforms the state-of-the-art models, including a multi-duration saliency model, on the SALICON benchmark. Our code will be publicly available on GitHub.
