Table of Contents
Fetching ...

Multi-Granularity Video Object Segmentation

Sangbeom Lim, Seongchan Kim, Seungjun An, Seokju Cho, Paul Hongsuck Seo, Seungryong Kim

TL;DR

This work tackles open-world video segmentation by introducing MUG-VOS, a large-scale dataset annotated with multi-granularity masks, enabling training and evaluation beyond salient objects. It proposes a SAM-based data collection pipeline to automatically generate dense, varied masks and a Memory-based Mask Propagation Model (MMPM) that retains target information over time via temporal and sequential memory and a memory-augmented attention mechanism. Empirical results show MMPM achieves state-of-the-art performance on MUG-VOS and transfers well to DAVIS-2017, outperforming SAM-based baselines and existing VOS methods. The dataset and model advance open-world, multi-granularity video understanding for interactive editing and open-world perception.

Abstract

Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods. Project page is available at https://cvlab-kaist.github.io/MUG-VOS.

Multi-Granularity Video Object Segmentation

TL;DR

This work tackles open-world video segmentation by introducing MUG-VOS, a large-scale dataset annotated with multi-granularity masks, enabling training and evaluation beyond salient objects. It proposes a SAM-based data collection pipeline to automatically generate dense, varied masks and a Memory-based Mask Propagation Model (MMPM) that retains target information over time via temporal and sequential memory and a memory-augmented attention mechanism. Empirical results show MMPM achieves state-of-the-art performance on MUG-VOS and transfers well to DAVIS-2017, outperforming SAM-based baselines and existing VOS methods. The dataset and model advance open-world, multi-granularity video understanding for interactive editing and open-world perception.

Abstract

Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods. Project page is available at https://cvlab-kaist.github.io/MUG-VOS.

Paper Structure

This paper contains 27 sections, 7 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: MUG-VOS data collection pipeline. We propose a data collection pipeline to generate a dataset to curate multi-granularity mask tracks completely automatically. Using SAM, we generate a large number of masks per frame and find a temporal connection through the IoU between the mask warped from the previous frame and the mask from the current frame.
  • Figure 2: MUG-VOS test dataset. To mitigate the accumulation of errors within the automated process, annotators were directed to either approve or reject the mask tracks generated by the data collection pipeline. In instances where errors were detected, the annotators performed frame-level refinements of the masks utilizing the Segment Anything Model kirillov2023segment.
  • Figure 3: Qualitative comparison between MMPM, DEVA cheng2023tracking, PerSAM-F zhang2023personalize, and SAM-PT rajivc2023segment from MUG-VOS test set.
  • Figure 4: MMPM overview. We introduce the MMPM model, which generates masks based on previous results. Starting from an initial mask that indicates the target object, the MMPM model consistently tracks and segments the target throughout the entire video. Sequential memory stores low-resolution features, updated at every selected frames, while temporal memory retains high-resolution features from previous frames, capturing a variety of information gathered from multiple frames.
  • Figure A.1: MUG-VOS annotation tool. We developed mask annotation tool for curating video segmentation dataset.
  • ...and 9 more figures