Table of Contents
Fetching ...

ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu

TL;DR

To effectively balance performance between single-object and multi-object scenarios, this report introduces a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 \(\mathcal{J}\&\mathcal{F}\) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.

ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

TL;DR

To effectively balance performance between single-object and multi-object scenarios, this report introduces a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.

Paper Structure

This paper contains 15 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our solution ReferDINO-Plus. For each video-description pair, we input it into ReferDINO to derive the object masks $M_r$ and the corresponding scores $S_r$ across the frames. Then, we select the mask with highest score as the prompt of SAM2, producing refined masks $M_s$. Finally, we fuse the two series of masks through the conditional mask fusion strategy. Best view in color.
  • Figure 2: Visualization results of our solution on MeViS test set.