Table of Contents
Fetching ...

Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation

Xiangyu Zheng, Wanyun Li, Songcheng He, Jianping Fan, Xiaoqiang Li, We Zhang

TL;DR

This work tackles unsupervised video object segmentation by balancing motion and appearance cues and leveraging the model’s own saliency. It introduces SMTC-Net, which combines a trunk-collateral encoder with an intrinsic saliency guided refinement module (ISRM) in a two-round decoding process, and employs a LoRA-based collateral path to capture motion-specific features with limited parameters. The method achieves state-of-the-art results on UVOS benchmarks (DAVIS-16, FBMS, YouTube-Objects) and four VSOD benchmarks, while reducing reliance on high-quality optical flow and avoiding heavy multi-stream fusion. These results demonstrate robust performance across diverse scenes and support practical deployment in UVOS and VSOD tasks.

Abstract

Recent mainstream unsupervised video object segmentation (UVOS) motion-appearance approaches use either the bi-encoder structure to separately encode motion and appearance features, or the uni-encoder structure for joint encoding. However, these methods fail to properly balance the motion-appearance relationship. Consequently, even with complex fusion modules for motion-appearance integration, the extracted suboptimal features degrade the models' overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high-quality segmentation results. To address these challenges, we propose the Saliency-Motion guided Trunk-Collateral Network (SMTC-Net), which better balances the motion-appearance relationship and incorporates model's intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. Accordingly, we propose a novel Trunk-Collateral structure for motion-appearance UVOS. The shared trunk backbone captures the motion-appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model's intrinsic saliency information to refine high-level features, and provide pixel-level guidance for motion-appearance fusion, thereby enhancing performance without additional input. Experimental results show that SMTC-Net achieved state-of-the-art performance on three UVOS datasets ( 89.2% J&F on DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS ) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.

Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation

TL;DR

This work tackles unsupervised video object segmentation by balancing motion and appearance cues and leveraging the model’s own saliency. It introduces SMTC-Net, which combines a trunk-collateral encoder with an intrinsic saliency guided refinement module (ISRM) in a two-round decoding process, and employs a LoRA-based collateral path to capture motion-specific features with limited parameters. The method achieves state-of-the-art results on UVOS benchmarks (DAVIS-16, FBMS, YouTube-Objects) and four VSOD benchmarks, while reducing reliance on high-quality optical flow and avoiding heavy multi-stream fusion. These results demonstrate robust performance across diverse scenes and support practical deployment in UVOS and VSOD tasks.

Abstract

Recent mainstream unsupervised video object segmentation (UVOS) motion-appearance approaches use either the bi-encoder structure to separately encode motion and appearance features, or the uni-encoder structure for joint encoding. However, these methods fail to properly balance the motion-appearance relationship. Consequently, even with complex fusion modules for motion-appearance integration, the extracted suboptimal features degrade the models' overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high-quality segmentation results. To address these challenges, we propose the Saliency-Motion guided Trunk-Collateral Network (SMTC-Net), which better balances the motion-appearance relationship and incorporates model's intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. Accordingly, we propose a novel Trunk-Collateral structure for motion-appearance UVOS. The shared trunk backbone captures the motion-appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model's intrinsic saliency information to refine high-level features, and provide pixel-level guidance for motion-appearance fusion, thereby enhancing performance without additional input. Experimental results show that SMTC-Net achieved state-of-the-art performance on three UVOS datasets ( 89.2% J&F on DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS ) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.

Paper Structure

This paper contains 27 sections, 13 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of typical motion-appearance UVOS paradigms and the proposed trunk-collateral architecture. (a) Bi-Encoder architecture with separate encoders for images and optical flows. (b) Uni-Encoder architecture with a shared encoder for images and optical flows. (c) Our Trunk-Collateral architecture which consists of both shared and specific parameters for images and optical flows.
  • Figure 2: Examples of optical flow maps in scenarios with partial estimation failure or suboptimal performance. Case 1–4 respectively represent stationary objects, motion blur, co-moving background, and misidentification of background objects.
  • Figure 3: Overall pipeline of SMTC-Net, which consists of a trunck-collateral encoder, an intrinsic saliency guided refinement module (ISRM) and a multi-level decoder. Given an input image and its corresponding optical flow, the trunk-collateral encoder captures the shared attributes of motion and appearance features through the common trunk section, while encoding distinctive motion characteristics via the collateral branches alongside the trunk section. ISRM optimizes feature representations and enhances motion-appearance integration using the intrinsic saliency feature generated in the first decoding round. The ultimate segmentation result is output in the second decoding round.
  • Figure 4: Illustration of the trunk-collateral structure in the Transformer block. (a) Trunk-collateral structure in the feed-forward network. (b) Trunk-collateral structure in the multi-head self-attention.
  • Figure 5: Illustration of the intrinsic saliency guided refinement module (ISRM). ISRM utilizes the intrinsic saliency feature produced in the first decoding round to optimize high-level representations, and guide motion-appearance integration to advance the ultimate performance.
  • ...and 3 more figures