Table of Contents
Fetching ...

Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track

Mingqi Gao, Sijie Li, Jungong Han

Abstract

This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM~3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM~3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.

Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track

Abstract

This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM~3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM~3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.
Paper Structure (12 sections, 2 figures, 1 table)

This paper contains 12 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of our two-stage framework. In the first stage, given the first-frame ground-truth mask as the visual prompt, we apply the SAM 3 detector to all remaining video frames to identify same-category object candidates, visualized by colored masks. Zoomed-in views are included for clearer visualization, as the target objects are often small. The detected regions are then encoded by DINOv3 and matched against a transformation-aware target feature pool constructed from the first-frame target, and a few high-confidence anchors are selected from later frames. In the second stage, the first-frame mask together with the selected anchor masks are used to re-prompt the SAM 3 tracker for mask propagation.
  • Figure 2: Qualitative comparison between first-frame-only propagation and our re-prompting strategy. The percentages indicate the temporal progress of each frame within the video. For each example, row (a) shows the original video frames, where the red box in the first frame marks the target object. Row (b) shows the result without re-prompting, where only the first-frame mask is used for propagation. Row (c) shows the result of our method.