Table of Contents
Fetching ...

Exploiting Minority Pseudo-Labels for Semi-Supervised Fine-grained Road Scene Understanding

Yuting Hong, Yongkang Wu, Hui Xiao, Huazheng Hao, Xiaojie Qiu, Baochen Yao, Chengbin Peng

TL;DR

The paper tackles the problem of semi-supervised semantic segmentation for fine-grained road scenes plagued by long-tailed class distributions that underrepresent minority classes. It introduces STPG, a synergistic framework that combines a professional module focused on minority pseudo-labels with a general module that leverages all pseudo-labels, augmented by anchor-based contrastive learning to evenly distribute class representations. A mismatch-score driven minority pseudo-label selection further enhances learning of hard classes, while cross-guided decoupling reduces model coupling between modules. Empirically, STPG yields strong gains on Cityscapes, CamVid, and PASCAL VOC 2012, with substantial improvements for tail classes and robust performance under limited labeled data, indicating practical impact for robust autonomous driving perception.

Abstract

In fine-grained road scene understanding, semantic segmentation plays a crucial role in enabling vehicles to perceive and comprehend their surroundings. By assigning a specific class label to each pixel in an image, it allows for precise identification and localization of detailed road features, which is vital for high-quality scene understanding and downstream perception tasks. A key challenge in this domain lies in improving the recognition performance of minority classes while mitigating the dominance of majority classes, which is essential for achieving balanced and robust overall performance. However, traditional semi-supervised learning methods often train models overlooking the imbalance between classes. To address this issue, firstly, we propose a general training module that learns from all the pseudo-labels without a conventional filtering strategy. Secondly, we propose a professional training module to learn specifically from reliable minority-class pseudo-labels identified by a novel mismatch score metric. The two modules are crossly supervised by each other so that it reduces model coupling which is essential for semi-supervised learning. During contrastive learning, to avoid the dominance of the majority classes in the feature space, we propose a strategy to assign evenly distributed anchors for different classes in the feature space. Experimental results on multiple public benchmarks show that our method surpasses traditional approaches in recognizing tail classes.

Exploiting Minority Pseudo-Labels for Semi-Supervised Fine-grained Road Scene Understanding

TL;DR

The paper tackles the problem of semi-supervised semantic segmentation for fine-grained road scenes plagued by long-tailed class distributions that underrepresent minority classes. It introduces STPG, a synergistic framework that combines a professional module focused on minority pseudo-labels with a general module that leverages all pseudo-labels, augmented by anchor-based contrastive learning to evenly distribute class representations. A mismatch-score driven minority pseudo-label selection further enhances learning of hard classes, while cross-guided decoupling reduces model coupling between modules. Empirically, STPG yields strong gains on Cityscapes, CamVid, and PASCAL VOC 2012, with substantial improvements for tail classes and robust performance under limited labeled data, indicating practical impact for robust autonomous driving perception.

Abstract

In fine-grained road scene understanding, semantic segmentation plays a crucial role in enabling vehicles to perceive and comprehend their surroundings. By assigning a specific class label to each pixel in an image, it allows for precise identification and localization of detailed road features, which is vital for high-quality scene understanding and downstream perception tasks. A key challenge in this domain lies in improving the recognition performance of minority classes while mitigating the dominance of majority classes, which is essential for achieving balanced and robust overall performance. However, traditional semi-supervised learning methods often train models overlooking the imbalance between classes. To address this issue, firstly, we propose a general training module that learns from all the pseudo-labels without a conventional filtering strategy. Secondly, we propose a professional training module to learn specifically from reliable minority-class pseudo-labels identified by a novel mismatch score metric. The two modules are crossly supervised by each other so that it reduces model coupling which is essential for semi-supervised learning. During contrastive learning, to avoid the dominance of the majority classes in the feature space, we propose a strategy to assign evenly distributed anchors for different classes in the feature space. Experimental results on multiple public benchmarks show that our method surpasses traditional approaches in recognizing tail classes.
Paper Structure (13 sections, 16 equations, 12 figures, 5 tables)

This paper contains 13 sections, 16 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Number of pixels in each class in Cityscapes dataset. The Cityscapes dataset exhibits a significant class imbalance, as reflected in the distribution of finely annotated pixels across different classes. The number of pixels per class, represented on the y-axis, varies widely, with certain dominant classes occupying the majority of the dataset's pixel annotations, while other less prevalent classes account for only a small fraction.
  • Figure 2: Overview of our framework. There are two training parts: professional and general, as illustrated in parts (a) and (b). In the professional training module, Pro-Student focusing on minority classes is supervised by selected pseudo-labels, and in the general training module, Gen-Student is supervised by all pseudo-labels. Teachers are updated Exponential Moving Average (EMA) ke2019dual to maintain a smoothed version of past student model parameters. Anchor contrastive learning is designed to foster more evenly distributed decision boundaries when training labels from different classes are imbalanced. The supervised learning from labeled data is omitted for simplicity. The network architecture of each student or teacher model is illustrated in part (c).
  • Figure 3: Illustration of pixel selection strategy.$\mathbf{I}(\cdot)[i,j]$ represents the predicted class for the pixel located at the coordinates $(i,j)$. (a) When the two predictions of a pixel are consistent, they are considered high-quality and used to train Pro-Student. (b) When the two predictions for a pixel are inconsistent, if the mismatch score of the class predicted by Gen-Teacher is larger than that of Pro-Student, the pixel usually contains more minority-class information and can be used for training Pro-Student. (c) Otherwise, these pseudo-labels are not used for training Pro-Student.
  • Figure 4: Illustration of confusion matrix. For each batch, we compute the confusion matrix between the predictions of Gen-Teacher and those of Pro-Student to obtain a mismatch score for each class. For example, $m_{p,q}$ is the number of pixels where Pro-Student's prediction is Class $p$ and Gen-Teacher's prediction is Class $q$ ($p,q \in [1,2,3,...,C]$). The proportions of mismatched predictions in the orange boxes and yellow boxes indicate the mismatch scores for Class $p$ and Class $q$, respectively.
  • Figure 5: Illustration of anchor contrastive learning. We generate anchors that are evenly distributed in the feature space and perform one-to-one matching with class prototypes. During training, we sample a large number of features near the anchors and combine them with the anchors for contrastive learning.
  • ...and 7 more figures