MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
TL;DR
MomentSeg addresses RefVOS and Temporal Sentence Grounding (TSG) with a unified framework that learns key moments directly, removing reliance on external keyframe models. It introduces a [FIND] token for frame-level temporal matching, a Moment-Centric Sampling (MCS) strategy that densely samples around text-relevant moments using a similarity distribution $\ell \in \mathbb{R}^{N_f \times L_t}$, and a Bidirectional Anchor-updated Propagation (BAP) mechanism for robust mask propagation from temporal anchors. Training leverages both TSG and RefVOS signals, while inference uses MCS to select informative frames and BAP to maintain segmentation quality over long sequences, achieving state-of-the-art results across multiple benchmarks and data regimes, with notable gains of about $5\%$ on MeViS and $6\%$ on ReVOS. The approach demonstrates strong temporal reasoning and efficient sampling in multimodal video understanding, with detailed reproducibility material and code to be released.
Abstract
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
