Table of Contents
Fetching ...

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H. S. Torr, Song Bai

TL;DR

MOSE presents a large-scale, complex-scene video object segmentation benchmark with 2,149 videos, 5,200 objects, and 431,725 masks across 36 categories, designed to stress occlusions, disappearance-reappearance, and crowding. The authors benchmark 18 VOS methods across four settings (mask-based semi-supervised, box-based semi-supervised, unsupervised, and interactive), revealing substantial performance gaps compared with DAVIS/YouTube-VOS (e.g., top semi-supervised MOSE J&F around 59.4%, vs ~90% on prior datasets). MOSE's statistics emphasize long-duration videos and frequent occlusions, making re-identification and temporal association core challenges. The paper analyzes results, discusses limitations, and suggests directions such as stronger re-identification, occlusion-aware segmentation, handling small objects and crowds, and long-term efficient VOS, while providing a public dataset release for the community.

Abstract

Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

TL;DR

MOSE presents a large-scale, complex-scene video object segmentation benchmark with 2,149 videos, 5,200 objects, and 431,725 masks across 36 categories, designed to stress occlusions, disappearance-reappearance, and crowding. The authors benchmark 18 VOS methods across four settings (mask-based semi-supervised, box-based semi-supervised, unsupervised, and interactive), revealing substantial performance gaps compared with DAVIS/YouTube-VOS (e.g., top semi-supervised MOSE J&F around 59.4%, vs ~90% on prior datasets). MOSE's statistics emphasize long-duration videos and frequent occlusions, making re-identification and temporal association core challenges. The paper analyzes results, discusses limitations, and suggests directions such as stronger re-identification, occlusion-aware segmentation, handling small objects and crowds, and long-term efficient VOS, while providing a public dataset release for the community.

Abstract

Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.
Paper Structure (20 sections, 2 equations, 2 figures, 6 tables)

This paper contains 20 sections, 2 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Examples of video clips from the coMplex video Object SEgmentation (MOSE) dataset. The selected target objects are masked in orange${\color{myorange}\mdblksquare}$. The most notable feature of MOSE is complex scenes, including the disappearance-reappearance of objects, small/inconspicuous objects, heavy occlusions, crowded environments, etc. For example, the target player in the 2nd row turns around when reappearing in the 4th and 5th columns after disappearing in the 3rd column, bringing challenges in re-identifying him. Most videos in MOSE contain crowded and occluded objects with the target object seldom being the salient one. The goal of MOSE dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.
  • Figure 2: Failure cases of the BOR indicator. It can be seen from the first row of samples that they have high BOR values, but there is less or no occlusion present. Samples in the second row have very small BOR values, but there are severe occlusions in the samples.