Table of Contents
Fetching ...

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai

TL;DR

MOSEv2 introduces a significantly more challenging video object segmentation dataset to narrow the gap between benchmark performance and real-world robustness. By expanding scale (5,024 videos, 701,976 masks) and complexity (adverse weather, low light, camouflage, non-physical targets, knowledge-dependent scenarios), MOSEv2 reveals pronounced drops for leading VOS and VOT methods and motivates targeted improvements. The authors propose advanced evaluation metrics including an adaptive boundary measure and disappearance/reappearance-specific scores, and demonstrate practical tricks such as RCMS, MQF, MSS, and LVT that substantially improve SAM2-based methods. Overall, MOSEv2 serves as a comprehensive platform to drive robust, real-world video understanding across segmentation and tracking tasks.

Abstract

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge.} We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real-world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

TL;DR

MOSEv2 introduces a significantly more challenging video object segmentation dataset to narrow the gap between benchmark performance and real-world robustness. By expanding scale (5,024 videos, 701,976 masks) and complexity (adverse weather, low light, camouflage, non-physical targets, knowledge-dependent scenarios), MOSEv2 reveals pronounced drops for leading VOS and VOT methods and motivates targeted improvements. The authors propose advanced evaluation metrics including an adaptive boundary measure and disappearance/reappearance-specific scores, and demonstrate practical tricks such as RCMS, MQF, MSS, and LVT that substantially improve SAM2-based methods. Overall, MOSEv2 serves as a comprehensive platform to drive robust, real-world video understanding across segmentation and tracking tasks.

Abstract

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge.} We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real-world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.

Paper Structure

This paper contains 19 sections, 1 equation, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: Example videos from the proposed MOSEv2 dataset. Selected target objects are masked in orange. The target in case ① is enlarged for better visualization. The most notable features of MOSEv2 include both challenges inherited from MOSEv1 MOSEv1 such as object disappearance-reappearance (①-⑩), small/inconspicuous objects (①,③,⑥), heavy occlusions (except ⑤), and crowded scenes (①,②), as well as newly introduced complexities such as adverse weather (⑥), low-light environments (⑤-⑦), multi-shots (⑧), camouflaged objects (⑤), non-physical objects (④), and knowledge dependency (⑨,⑩). The goal of MOSEv2 dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.
  • Figure 2: Category distributions of MOSEv1 MOSEv1 and the proposed MOSEv2.
  • Figure 3: Occlusion evaluation protocol. (a) BOR: Bounding-box Occlusion Rate OVIS, (b) AOR: Amodal-mask Occlusion Rate, (c) MLLMOR: MLLM-assisted Occlusion Rate.
  • Figure 4: Mask size distribution, normalized by video resolution.
  • Figure 5: Video length distributions. Compared to MOSEv1, MOSEv2 includes more long videos, with the longest reaching 7,825 frames.
  • ...and 6 more figures