Table of Contents
Fetching ...

1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

Mingqi Gao, Jingnan Luo, Jinyu Yang, Jungong Han, Feng Zheng

TL;DR

This technical report investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting and achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.

Abstract

Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many new challenges to the field of referring video object segmentation (RVOS). In this technical report, we investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting. Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge. The code is available at: https://github.com/Tapall-AI/MeViS_Track_Solution_2024.

1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

TL;DR

This technical report investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting and achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.

Abstract

Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many new challenges to the field of referring video object segmentation (RVOS). In this technical report, we investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting. Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge. The code is available at: https://github.com/Tapall-AI/MeViS_Track_Solution_2024.
Paper Structure (13 sections, 5 figures, 3 tables)

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our solution. Given an input video, we divide all frames into $N$ subsets via non-continuous sampling. Here we take two subsets as an example. They are marked with Blue and Green boxes. In particular, each subset is segmented individually, guided by the input text, and combined for the final results.
  • Figure 2: Difference between (a) Non-continuous sampling, (b) Continuous sampling, and (c) No sampling. Each box here denotes one frame, and the same colour boxes are sampled as pseudo videos for referring segmentation. Note that the best object trajectory selection is performed individually in each sampled video. For 'No sampling', we still divide videos into subsets and predict masks upon those. During the selection, we feed all object queries into temporal modules and consider the resulting probabilities to select the best mask trajectory.
  • Figure 3: Qualitative predictions of our solution on the MeViS valid set. Orange and Green masks are the predictions guided by the texts with the same colour. The percentage indicates the position of corresponding frame in the video.
  • Figure 4: Qualitative ablations on training data. The percentage indicates the position of corresponding frame in the video.
  • Figure 5: Qualitative ablations on subset video length. The percentage indicates the position of corresponding frame in the video.