Table of Contents
Fetching ...

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

TL;DR

MomentSeg addresses RefVOS and Temporal Sentence Grounding (TSG) with a unified framework that learns key moments directly, removing reliance on external keyframe models. It introduces a [FIND] token for frame-level temporal matching, a Moment-Centric Sampling (MCS) strategy that densely samples around text-relevant moments using a similarity distribution $\ell \in \mathbb{R}^{N_f \times L_t}$, and a Bidirectional Anchor-updated Propagation (BAP) mechanism for robust mask propagation from temporal anchors. Training leverages both TSG and RefVOS signals, while inference uses MCS to select informative frames and BAP to maintain segmentation quality over long sequences, achieving state-of-the-art results across multiple benchmarks and data regimes, with notable gains of about $5\%$ on MeViS and $6\%$ on ReVOS. The approach demonstrates strong temporal reasoning and efficient sampling in multimodal video understanding, with detailed reproducibility material and code to be released.

Abstract

Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

TL;DR

MomentSeg addresses RefVOS and Temporal Sentence Grounding (TSG) with a unified framework that learns key moments directly, removing reliance on external keyframe models. It introduces a [FIND] token for frame-level temporal matching, a Moment-Centric Sampling (MCS) strategy that densely samples around text-relevant moments using a similarity distribution , and a Bidirectional Anchor-updated Propagation (BAP) mechanism for robust mask propagation from temporal anchors. Training leverages both TSG and RefVOS signals, while inference uses MCS to select informative frames and BAP to maintain segmentation quality over long sequences, achieving state-of-the-art results across multiple benchmarks and data regimes, with notable gains of about on MeViS and on ReVOS. The approach demonstrates strong temporal reasoning and efficient sampling in multimodal video understanding, with detailed reproducibility material and code to be released.

Abstract

Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

Paper Structure

This paper contains 37 sections, 8 equations, 12 figures, 18 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of Moment-Centric Sampling. (a) Two example expressions with 10 selected video frames and GT masks. (b) Comparison of sampling strategies: handcrafted, external keyframe–based, and our proposed Moment-Centric Sampling (MCS). MCS performs dense sampling at critical moments and sparse sampling elsewhere, without relying on external keyframe models. (c) MomentSeg achieves superior performance across RefVOS, TSG, and QA tasks compared with other LMM-based methods.
  • Figure 2: Impact of Sampling Strategies. We evaluate various sampling strategies for the RefVOS task on four datasets. Five independent runs of random sampling show high variance, highlighting the need for a robust sampling mechanism. Our proposed MCS consistently outperforms alternative methods across all datasets.
  • Figure 3: Training framework of the proposed MomentSeg model. In the TSG paradigm, we employ the Qwen2.5-VL Qwen2.5-VL video encoder with low-resolution image inputs. We further introduce a [FIND] token trained under a contrastive learning scheme. In the RefVOS paradigm, both low-resolution uniformly sampled frames and high-resolution continuously sampled frames are used as inputs. Supervision is provided via the [SEG] token, with segmentation masks generated by the SAM2 decoder.
  • Figure 4: Inference framework of the proposed MomentSeg model. First, we use dense-sampled video frames to find the frame-query similarity distribution related to the description, then apply a Moment-Centric Sampling (MCS) to select key frames from the sequence. These key frames are input to the model, along with the RefVOS instructions, to perform the RefVOS task. Finally, we enhance segmentation robustness through Bidirectional Anchor-updated Propagation (BAP).
  • Figure 5: Qualitative results of MomentSeg on TSG and RefVOS tasks. The figure displays the input expression, frame-query similarity distribution, sampled frames, and the predicted segmentation masks for TSG and RefVOS. (a) illustrates an example for the referring task, while (b) presents an example for the reasoning task.
  • ...and 7 more figures