Table of Contents
Fetching ...

Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

TL;DR

This work tackles pixel-level video semantic segmentation under semi-supervised settings by leveraging unreliable pseudo labels. It employs a transformer-based teacher–student framework (One-peace as teacher, ViT-Adapter as student) with entropy-based pseudo-label filtering and a contrastive loss to utilize uncertain pixels as negatives. Through multi-scale testing, ensemble pseudo-label generation, and data augmentation, the method achieves state-of-the-art mIoU on PVUW2024 and secures 1st place in the VSS track. The approach demonstrates how careful handling of unreliable predictions and model ensembles can markedly reduce labeling requirements while delivering robust video segmentation performance.

Abstract

Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.

Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

TL;DR

This work tackles pixel-level video semantic segmentation under semi-supervised settings by leveraging unreliable pseudo labels. It employs a transformer-based teacher–student framework (One-peace as teacher, ViT-Adapter as student) with entropy-based pseudo-label filtering and a contrastive loss to utilize uncertain pixels as negatives. Through multi-scale testing, ensemble pseudo-label generation, and data augmentation, the method achieves state-of-the-art mIoU on PVUW2024 and secures 1st place in the VSS track. The approach demonstrates how careful handling of unreliable predictions and model ensembles can markedly reduce labeling requirements while delivering robust video segmentation performance.

Abstract

Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.
Paper Structure (14 sections, 6 equations, 2 figures, 4 tables)

This paper contains 14 sections, 6 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The overall architecture of our method.
  • Figure 2: Qualitative result on VSPWmiao2021vspw test set of out method.