Table of Contents
Fetching ...

ClickVOS: Click Video Object Segmentation

Pinxue Guo, Lingyi Hong, Xinyu Zhou, Shuyong Gao, Wanyun Li, Jinglun Li, Zhaoyu Chen, Xiaoqiang Li, Wei Zhang, Wenqiang Zhang

TL;DR

An end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans is proposed, which utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention.

Abstract

Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.

ClickVOS: Click Video Object Segmentation

TL;DR

An end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans is proposed, which utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention.

Abstract

Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.
Paper Structure (29 sections, 10 equations, 7 figures, 7 tables)

This paper contains 29 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Different VOS tasks and the interaction time comparison between SemiVOS and ClickVOS. SemiVOS requires time-consuming manual masks of target objects at the first frame during inference while UnVOS lack the flexibility to specify arbitrary objects of interest.
  • Figure 2: The pipeline of the proposed ABS approach for the ClickVOS problem. We extract bimodal features that contain appearance and motion information by the Bimodal Enhance Encoder. At the first frame, Point Tokenizer encodes the object tokens with identity embedding according to the given points, and Segment Attention estimates the initial object masks, which may be imprecise. But as the video progresses, a growing memory is maintained, leading to the self-healing of object masks.
  • Figure 3: Details of the (a) Segment Attention that achieve segmentation by simple attention layers, and the (b) Improvement Memory with Point Tokenizer.
  • Figure 4: Illustration of baselines exploration of utilizing off-the-shelf algorithms from related fields to address the ClickVOS task.
  • Figure 5: The self-healing process in our proposed ABS approach, White dashed boxes shows segmentation self-healing as the video goes on.
  • ...and 2 more figures