Table of Contents
Fetching ...

TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction

Bahar Aydemir, Ludo Hoffstetter, Tong Zhang, Mathieu Salzmann, Sabine Süsstrunk

TL;DR

TempSAL addresses the lack of temporal dynamics in saliency prediction by modeling attention as time-resolved trajectories. It introduces a temporal-slice decoder and a spatiotemporal mixing module that jointly predict time-specific saliency maps and refine an image saliency map, leveraging a multi-level encoder (PNASNet-5). Across SALICON and CodeCharts1k, TempSAL outperforms state-of-the-art models and a multi-duration baseline, demonstrating that explicit temporal information improves accuracy. The approach also provides insights into gaze evolution, suggesting practical benefits for design, advertising, and user experience applications where temporal attention matters.

Abstract

Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information, such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals by exploiting human temporal attention patterns. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our experiments show that our method outperforms the state-of-the-art models, including a multi-duration saliency model, on the SALICON benchmark. Our code will be publicly available on GitHub.

TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction

TL;DR

TempSAL addresses the lack of temporal dynamics in saliency prediction by modeling attention as time-resolved trajectories. It introduces a temporal-slice decoder and a spatiotemporal mixing module that jointly predict time-specific saliency maps and refine an image saliency map, leveraging a multi-level encoder (PNASNet-5). Across SALICON and CodeCharts1k, TempSAL outperforms state-of-the-art models and a multi-duration baseline, demonstrating that explicit temporal information improves accuracy. The approach also provides insights into gaze evolution, suggesting practical benefits for design, advertising, and user experience applications where temporal attention matters.

Abstract

Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information, such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals by exploiting human temporal attention patterns. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our experiments show that our method outperforms the state-of-the-art models, including a multi-duration saliency model, on the SALICON benchmark. Our code will be publicly available on GitHub.
Paper Structure (22 sections, 8 equations, 7 figures, 6 tables)

This paper contains 22 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of how human attention evolves over time. Top row: Temporal and image saliency ground truth from the SALICON dataset salicon. Bottom row: Our temporal and image saliency predictions. Each temporal saliency map $\mathcal{T}_i$, $i \in \{1,\ldots,5\}$ represents one second of observation time. Note that in $\mathcal{T}_1$, the chef is salient, while in $\mathcal{T}_2$ and $\mathcal{T}_3$, the food on the barbecue becomes the most salient region in this scene. We can predict the temporal saliency maps for each interval separately, or combine them to create a single, refined image saliency map for the entire observation period.
  • Figure 2: Average heat maps for each one second interval. Note that a center-bias occurs, similar to image saliency prediction's average ground-truth maps.
  • Figure 3: Differences of the consecutive average temporal slices shown in Fig. \ref{['fig:avg-slices']}. Red indicates regions of increased attention whereas blue indicates decreased attention.
  • Figure 4: Number of fixations with their respective saliency values and timestamps. Lighter colors indicate higher number of occurrences while darker areas denote fewer occurrences. We see that late fixations tend to be less salient, which can be seen as the decrease in the number of salient fixations along the arrow. The most salient fixations appear at approximately 1s.
  • Figure 5: Overview of the proposed architecture. We encode image features into encoder blocks consisting of multi-level image features. We then pass these blocks to the temporal saliency decoder (shown in orange) to decode them into temporal saliency predictions, which are saliency maps in sequential time intervals. In parallel, the image saliency decoder (shown in green) decodes the encoder blocks into an image saliency prediction. We then combine (1) the temporal saliency maps, (2) the image saliency map, and (3) the encoder blocks in the spatiotemporal mixing module (shown in pink). (Best viewed in color.)
  • ...and 2 more figures