Table of Contents
Fetching ...

Motion-aware Memory Network for Fast Video Salient Object Detection

Xing Zhao, Haoran Liang, Peipei Li, Guodao Sun, Dongdong Zhao, Ronghua Liang, Xiaofei He

TL;DR

This work tackles the efficiency–quality gap in video salient object detection by integrating an Adjacent Space-Time Memory Module (ASTM) into a standard encoder–decoder framework to capture temporal cues without optical flow. A novel Feature Fusion Strategy (FFS) combines high-level temporal semantical information with low-level details, while a motion-aware multitask loss introduces boundary motion supervision to jointly predict saliency and motion. The approach demonstrates state-of-the-art or competitive performance on large datasets like DAVSOD, with substantial speed advantages (≈100 FPS) due to flow-free processing and efficient memory reads. The results underscore the value of high-level feature-guided temporal memory and multitask learning for robust, real-time VSOD, and highlight limitations related to long-term dependencies and complex lighting. Overall, the method advances VSOD by delivering accurate, temporally coherent saliency maps at practical speeds, with clear avenues for extending memory depth and multiclass capabilities.

Abstract

Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, they still suffer from high computational costs or poor quality of the generated saliency maps. To solve these problems, we design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. Furthermore, previous methods only considered single-frame prediction without temporal association. As a result, the model may not focus on the temporal information sufficiently. Thus, we initially introduce object motion prediction between inter-frame into VSOD. Our model follows standard encoder--decoder architecture. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. This approach is more efficient than the optical flow-based methods. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The semantic information of the high-level features is used to fuse the object details in the low-level features, and then the spatiotemporal features are obtained step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision commonly used in image salient object detection (ISOD), we design a motion-aware loss for predicting object boundary motion and simultaneously perform multitask learning for VSOD and object motion prediction, which can further facilitate the model to extract spatiotemporal features accurately and maintain the object integrity. Extensive experiments on several datasets demonstrated the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.

Motion-aware Memory Network for Fast Video Salient Object Detection

TL;DR

This work tackles the efficiency–quality gap in video salient object detection by integrating an Adjacent Space-Time Memory Module (ASTM) into a standard encoder–decoder framework to capture temporal cues without optical flow. A novel Feature Fusion Strategy (FFS) combines high-level temporal semantical information with low-level details, while a motion-aware multitask loss introduces boundary motion supervision to jointly predict saliency and motion. The approach demonstrates state-of-the-art or competitive performance on large datasets like DAVSOD, with substantial speed advantages (≈100 FPS) due to flow-free processing and efficient memory reads. The results underscore the value of high-level feature-guided temporal memory and multitask learning for robust, real-time VSOD, and highlight limitations related to long-term dependencies and complex lighting. Overall, the method advances VSOD by delivering accurate, temporally coherent saliency maps at practical speeds, with clear avenues for extending memory depth and multiclass capabilities.

Abstract

Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, they still suffer from high computational costs or poor quality of the generated saliency maps. To solve these problems, we design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. Furthermore, previous methods only considered single-frame prediction without temporal association. As a result, the model may not focus on the temporal information sufficiently. Thus, we initially introduce object motion prediction between inter-frame into VSOD. Our model follows standard encoder--decoder architecture. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. This approach is more efficient than the optical flow-based methods. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The semantic information of the high-level features is used to fuse the object details in the low-level features, and then the spatiotemporal features are obtained step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision commonly used in image salient object detection (ISOD), we design a motion-aware loss for predicting object boundary motion and simultaneously perform multitask learning for VSOD and object motion prediction, which can further facilitate the model to extract spatiotemporal features accurately and maintain the object integrity. Extensive experiments on several datasets demonstrated the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
Paper Structure (28 sections, 11 equations, 11 figures, 6 tables)

This paper contains 28 sections, 11 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Comparison of model size and mean absolute error (MAE) on DAVSOD. Models situated closer to the bottom-left corner are more efficient and effective. DCF zhang2021dynamic and STVS chen2021exploring represent the latest methodologies, the former has a better metric in terms of MAE but contains numerous parameters, while the latter has fewer parameters but compromises on its MAE performance. Our method achieves the best MAE metric on the largest VSOD dataset while utilizing fewer parameters. Moreover, its inference speed surpasses that of the majority of prior methods.
  • Figure 2: Comparative overview of the architecture of different methods.
  • Figure 3: Overall structure of the model. Two adjacent frames [$x_{t-1},x_{t+1}$] are placed into the memory when the first frame $x_t$ is sent to the encoder. High-level temporal features ($E_t$) are obtained by the memory read operation. Finally, two high-level features $E_t$ (temporal branch) and $E_Q^{res5}$ (spatial branch) and all the low-level features are used to reconstruct the final saliency map.
  • Figure 4: Detailed structure of the decoder. The high-level temporal feature is fused with the low-level features of adjacent frames by FFS, and the spatial branch follows the same process. The final saliency map is obtained by combining the spatiotemporal features generated in three stages. The channels of all features are shrunk to $64$ by $1\times1$ convolutional layer prior to fusion to reduce the computation costs.
  • Figure 5: Visualization of three feature fusion cases. We compare the results of fusing high- and low-level features by the proposed FFS and skip concatenation. The features fused by FFS (column 3) have more accurate response and details for salient regions.
  • ...and 6 more figures