Table of Contents
Fetching ...

Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Suhwan Cho, Minhyeok Lee, Jungho Lee, MyeongAh Cho, Seungwook Park, Jaeyeob Kim, Hyunsung Jang, Sangyoun Lee

TL;DR

This work tackles unsupervised video object segmentation by reducing reliance on motion cues through a motion-as-option network that can operate with or without motion input. It couples independent appearance and motion encoders, a collaborative training strategy mixing VOS and SOD data, and an adaptive test-time output selection to pick the most reliable prediction. The approach achieves state-of-the-art performance on major benchmarks while maintaining real-time inference, and includes an acceleration workflow to mitigate inference overhead. Together, these contributions offer a robust, practical baseline for future VOS research and application.

Abstract

Unsupervised video object segmentation aims to detect the most salient object in a video without any external guidance regarding the object. Salient objects often exhibit distinctive movements compared to the background, and recent methods leverage this by combining motion cues from optical flow maps with appearance cues from RGB images. However, because optical flow maps are often closely correlated with segmentation masks, networks can become overly dependent on motion cues during training, leading to vulnerability when faced with confusing motion cues and resulting in unstable predictions. To address this challenge, we propose a novel motion-as-option network that treats motion cues as an optional component rather than a necessity. During training, we randomly input RGB images into the motion encoder instead of optical flow maps, which implicitly reduces the network's reliance on motion cues. This design ensures that the motion encoder is capable of processing both RGB images and optical flow maps, leading to two distinct predictions depending on the type of input provided. To make the most of this flexibility, we introduce an adaptive output selection algorithm that determines the optimal prediction during testing.

Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

TL;DR

This work tackles unsupervised video object segmentation by reducing reliance on motion cues through a motion-as-option network that can operate with or without motion input. It couples independent appearance and motion encoders, a collaborative training strategy mixing VOS and SOD data, and an adaptive test-time output selection to pick the most reliable prediction. The approach achieves state-of-the-art performance on major benchmarks while maintaining real-time inference, and includes an acceleration workflow to mitigate inference overhead. Together, these contributions offer a robust, practical baseline for future VOS research and application.

Abstract

Unsupervised video object segmentation aims to detect the most salient object in a video without any external guidance regarding the object. Salient objects often exhibit distinctive movements compared to the background, and recent methods leverage this by combining motion cues from optical flow maps with appearance cues from RGB images. However, because optical flow maps are often closely correlated with segmentation masks, networks can become overly dependent on motion cues during training, leading to vulnerability when faced with confusing motion cues and resulting in unstable predictions. To address this challenge, we propose a novel motion-as-option network that treats motion cues as an optional component rather than a necessity. During training, we randomly input RGB images into the motion encoder instead of optical flow maps, which implicitly reduces the network's reliance on motion cues. This design ensures that the motion encoder is capable of processing both RGB images and optical flow maps, leading to two distinct predictions depending on the type of input provided. To make the most of this flexibility, we introduce an adaptive output selection algorithm that determines the optimal prediction during testing.
Paper Structure (18 sections, 9 equations, 10 figures, 11 tables, 3 algorithms)

This paper contains 18 sections, 9 equations, 10 figures, 11 tables, 3 algorithms.

Figures (10)

  • Figure 1: Visualized comparison of (a) a conventional two-stream VOS network and (b) our proposed motion-as-option network. Unlike existing methods, our motion-as-option network is designed to handle both RGB images and optical flow maps as motion inputs.
  • Figure 2: Architecture of our proposed network. When motion cues are leveraged, an optical flow map serves as the motion input. Alternatively, when motion cues are not utilized, an RGB image is used as the motion input. The network extracts appearance and motion features through separate encoders, which are then fused together and gradually decoded to produce the final segmentation mask.
  • Figure 3: Visualized pipeline of the adaptive output selection algorithm. The motion-as-option network produces two different outputs: one using an RGB image as the motion input and the other using an optical flow map as the motion input. The final segmentation mask is then obtained by evaluating each output based on the overall confidence scores.
  • Figure 4: Qualitative comparison of flow-based approaches in challenging flow scenarios.
  • Figure 5: Automatic object segmentation results on videos.
  • ...and 5 more figures