MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding; Kaining Ying; Chang Liu; Shuting He; Xudong Jiang; Yu-Gang Jiang; Philip H. S. Torr; Song Bai

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai

TL;DR

MOSEv2 introduces a significantly more challenging video object segmentation dataset to narrow the gap between benchmark performance and real-world robustness. By expanding scale (5,024 videos, 701,976 masks) and complexity (adverse weather, low light, camouflage, non-physical targets, knowledge-dependent scenarios), MOSEv2 reveals pronounced drops for leading VOS and VOT methods and motivates targeted improvements. The authors propose advanced evaluation metrics including an adaptive boundary measure and disappearance/reappearance-specific scores, and demonstrate practical tricks such as RCMS, MQF, MSS, and LVT that substantially improve SAM2-based methods. Overall, MOSEv2 serves as a comprehensive platform to drive robust, real-world video understanding across segmentation and tracking tasks.

Abstract

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge.} We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real-world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

TL;DR

Abstract

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)