Table of Contents
Fetching ...

Balancing long- and short-term dynamics for the modeling of saliency in videos

Theodor Wulff, Fares Abawi, Philipp Allgeuer, Stefan Wermter

TL;DR

This work tackles the challenge of saliency prediction in videos by modeling both long- and short-term temporal dynamics with a dual-stream Transformer that processes frames and past saliency maps via tubelet-based spatiotemporal tokens. A saliency-prior masking scheme and frame dropout allow the model to leverage prior focus while remaining robust to dynamic shifts, with a fusion encoder and interchangeable decoder heads enabling flexible task adaptation. Key findings show that long-term context improves predictions beyond a saturation point for short-term context, and an optimal short-term depth exists roughly at half the long-term span, guiding design choices for temporal context. The approach advances video salient object detection by revealing how to balance temporal contexts in Transformer-based architectures and by providing a practical framework that can adapt to varying downstream saliency tasks.

Abstract

The role of long- and short-term dynamics towards salient object detection in videos is under-researched. We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video. We provide our model with a stream of video frames and past saliency maps, which acts as a prior for the next prediction, and extract spatiotemporal tokens from both modalities. The decomposition of the frame sequence into tokens lets the model incorporate short-term information from within the token, while being able to make long-term connections between tokens throughout the sequence. The core of the system consists of a dual-stream Transformer architecture to process the extracted sequences independently before fusing the two modalities. Additionally, we apply a saliency-based masking scheme to the input frames to learn an embedding that facilitates the recognition of deviations from previous outputs. We observe that the additional prior information aids in the first detection of the salient location. Our findings indicate that the ratio of spatiotemporal long- and short-term features directly impacts the model's performance. While increasing the short-term context is beneficial up to a certain threshold, the model's performance greatly benefits from an expansion of the long-term context.

Balancing long- and short-term dynamics for the modeling of saliency in videos

TL;DR

This work tackles the challenge of saliency prediction in videos by modeling both long- and short-term temporal dynamics with a dual-stream Transformer that processes frames and past saliency maps via tubelet-based spatiotemporal tokens. A saliency-prior masking scheme and frame dropout allow the model to leverage prior focus while remaining robust to dynamic shifts, with a fusion encoder and interchangeable decoder heads enabling flexible task adaptation. Key findings show that long-term context improves predictions beyond a saturation point for short-term context, and an optimal short-term depth exists roughly at half the long-term span, guiding design choices for temporal context. The approach advances video salient object detection by revealing how to balance temporal contexts in Transformer-based architectures and by providing a practical framework that can adapt to varying downstream saliency tasks.

Abstract

The role of long- and short-term dynamics towards salient object detection in videos is under-researched. We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video. We provide our model with a stream of video frames and past saliency maps, which acts as a prior for the next prediction, and extract spatiotemporal tokens from both modalities. The decomposition of the frame sequence into tokens lets the model incorporate short-term information from within the token, while being able to make long-term connections between tokens throughout the sequence. The core of the system consists of a dual-stream Transformer architecture to process the extracted sequences independently before fusing the two modalities. Additionally, we apply a saliency-based masking scheme to the input frames to learn an embedding that facilitates the recognition of deviations from previous outputs. We observe that the additional prior information aids in the first detection of the salient location. Our findings indicate that the ratio of spatiotemporal long- and short-term features directly impacts the model's performance. While increasing the short-term context is beneficial up to a certain threshold, the model's performance greatly benefits from an expansion of the long-term context.

Paper Structure

This paper contains 23 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Model components. The past saliency maps are utilized for the masking of the input frames. From both inputs a sequence of tubelets is extracted which serves as input for the domain-specific encoder blocks that attend to the sequence along different dimensions before the interchangeable decoder head generates the final output.
  • Figure 2: Precision, F-Score and S Measure across different $d_t$ for $d_f=8$ and $d_f=12$.
  • Figure 3: Output of different tubelet depths with an input sequence of 12 frames.