Table of Contents
Fetching ...

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

TL;DR

This work addresses the vulnerability of traditional VOS methods to drastic appearance changes and occlusions by shifting from appearance-based matching to concept-driven segmentation. It introduces Segment Concept (SeC), which progressively builds object-level concepts using LVLMs and applies a scene-adaptive activation strategy to balance semantic reasoning with efficient pixel-level matching. A new benchmark, SeCVOS, evaluates models on complex, multi-shot scenarios that require high-level semantic understanding, where SeC achieves a substantial 11.8-point improvement over SAM 2.1. The results demonstrate that integrating concept-level reasoning with memory-based tracking yields robust, semantically aware VOS, suggesting a promising direction for future video understanding systems and benchmark development.

Abstract

Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

TL;DR

This work addresses the vulnerability of traditional VOS methods to drastic appearance changes and occlusions by shifting from appearance-based matching to concept-driven segmentation. It introduces Segment Concept (SeC), which progressively builds object-level concepts using LVLMs and applies a scene-adaptive activation strategy to balance semantic reasoning with efficient pixel-level matching. A new benchmark, SeCVOS, evaluates models on complex, multi-shot scenarios that require high-level semantic understanding, where SeC achieves a substantial 11.8-point improvement over SAM 2.1. The results demonstrate that integrating concept-level reasoning with memory-based tracking yields robust, semantically aware VOS, suggesting a promising direction for future video understanding systems and benchmark development.

Abstract

Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.

Paper Structure

This paper contains 27 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of our Segment Concept (SeC) framework. Left: Compared to SAM 2, our model maintains better target tracking under severe appearance changes and scene transitions by leveraging concept-level guidance. Right: Quantitative results show that SeC consistently outperforms strong baselines, especially in scenarios involving multiple scene changes.
  • Figure 2: $\mathcal{J}\&\mathcal{F}$ Curve in terms of concept guidance ratio on SeCVOS. Sparse activation (e.g., under 10%) already achieves strong performance.
  • Figure 3: Qualitative comparison between SAM 2 and SeC (ours) on the SeCVOS benchmark.
  • Figure 4: Example video sequences from the SeCVOS benchmark with overlaid target masks. Each row corresponds to frames from a single video sequence, illustrating the annotated object masks.
  • Figure 5: Example video sequences and corresponding referring expressions from the SeCVOS benchmark.
  • ...and 2 more figures