Table of Contents
Fetching ...

2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

TL;DR

This paper tackles Video Panoptic Segmentation by leveraging a robust, decoupled VPS framework (DVIS++) and augmenting it with an image semantic segmentation backbone (ViT-Adapter) to strengthen semantic labeling. The authors employ a three-stage training pipeline and an ensemble strategy to align sequence labels across frames, achieving VPQ scores of 56.36 (development) and 57.12 (test), ranking 2nd in PVUW 2024. Key contributions include integrating DVIS++ with strong semantic segmentation, and demonstrating performance gains in both segmentation and long-term tracking on real-world VPS benchmarks. The approach has practical impact for video understanding tasks in domains like autonomous driving and video analytics, where reliable panoptic labeling across time is crucial.

Abstract

Video Panoptic Segmentation (VPS) is a challenging task that is extends from image panoptic segmentation.VPS aims to simultaneously classify, track, segment all objects in a video, including both things and stuff. Due to its wide application in many downstream tasks such as video understanding, video editing, and autonomous driving. In order to deal with the task of video panoptic segmentation in the wild, we propose a robust integrated video panoptic segmentation solution. We use DVIS++ framework as our baseline to generate the initial masks. Then,we add an additional image semantic segmentation model to further improve the performance of semantic classes.Finally, our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases, respectively, and ultimately ranked 2nd in the VPS track of the PVUW Challenge at CVPR2024.

2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

TL;DR

This paper tackles Video Panoptic Segmentation by leveraging a robust, decoupled VPS framework (DVIS++) and augmenting it with an image semantic segmentation backbone (ViT-Adapter) to strengthen semantic labeling. The authors employ a three-stage training pipeline and an ensemble strategy to align sequence labels across frames, achieving VPQ scores of 56.36 (development) and 57.12 (test), ranking 2nd in PVUW 2024. Key contributions include integrating DVIS++ with strong semantic segmentation, and demonstrating performance gains in both segmentation and long-term tracking on real-world VPS benchmarks. The approach has practical impact for video understanding tasks in domains like autonomous driving and video analytics, where reliable panoptic labeling across time is crucial.

Abstract

Video Panoptic Segmentation (VPS) is a challenging task that is extends from image panoptic segmentation.VPS aims to simultaneously classify, track, segment all objects in a video, including both things and stuff. Due to its wide application in many downstream tasks such as video understanding, video editing, and autonomous driving. In order to deal with the task of video panoptic segmentation in the wild, we propose a robust integrated video panoptic segmentation solution. We use DVIS++ framework as our baseline to generate the initial masks. Then,we add an additional image semantic segmentation model to further improve the performance of semantic classes.Finally, our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases, respectively, and ultimately ranked 2nd in the VPS track of the PVUW Challenge at CVPR2024.
Paper Structure (11 sections, 1 equation, 3 figures, 3 tables)

This paper contains 11 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of DVIS++zhang2023dvisplus.
  • Figure 2: Architecture of ViT-Adapterchen2022vitadapter.
  • Figure 3: Qualitative result on VIPSeg test set of out method.