Table of Contents
Fetching ...

PAT: Pixel-wise Adaptive Training for Long-tailed Segmentation

Khoi Do, Duong Nguyen, Nguyen H. Tran, Viet Dung Nguyen

TL;DR

The proposed Pixel-wise Adaptive Training technique tackles the detrimental impact of both rare classes within the long-tailed distribution and inaccurate predictions from previous training stages by encouraging learning classes with low prediction confidence and guarding against forgetting classes with high confidence.

Abstract

Beyond class frequency, we recognize the impact of class-wise relationships among various class-specific predictions and the imbalance in label masks on long-tailed segmentation learning. To address these challenges, we propose an innovative Pixel-wise Adaptive Training (PAT) technique tailored for long-tailed segmentation. PAT has two key features: 1) class-wise gradient magnitude homogenization, and 2) pixel-wise class-specific loss adaptation (PCLA). First, the class-wise gradient magnitude homogenization helps alleviate the imbalance among label masks by ensuring equal consideration of the class-wise impact on model updates. Second, PCLA tackles the detrimental impact of both rare classes within the long-tailed distribution and inaccurate predictions from previous training stages by encouraging learning classes with low prediction confidence and guarding against forgetting classes with high confidence. This combined approach fosters robust learning while preventing the model from forgetting previously learned knowledge. PAT exhibits significant performance improvements, surpassing the current state-of-the-art by 2.2% in the NyU dataset. Moreover, it enhances overall pixel-wise accuracy by 2.85% and intersection over union value by 2.07%, with a particularly notable declination of 0.39% in detecting rare classes compared to Balance Logits Variation, as demonstrated on the three popular datasets, i.e., OxfordPetIII, CityScape, and NYU.

PAT: Pixel-wise Adaptive Training for Long-tailed Segmentation

TL;DR

The proposed Pixel-wise Adaptive Training technique tackles the detrimental impact of both rare classes within the long-tailed distribution and inaccurate predictions from previous training stages by encouraging learning classes with low prediction confidence and guarding against forgetting classes with high confidence.

Abstract

Beyond class frequency, we recognize the impact of class-wise relationships among various class-specific predictions and the imbalance in label masks on long-tailed segmentation learning. To address these challenges, we propose an innovative Pixel-wise Adaptive Training (PAT) technique tailored for long-tailed segmentation. PAT has two key features: 1) class-wise gradient magnitude homogenization, and 2) pixel-wise class-specific loss adaptation (PCLA). First, the class-wise gradient magnitude homogenization helps alleviate the imbalance among label masks by ensuring equal consideration of the class-wise impact on model updates. Second, PCLA tackles the detrimental impact of both rare classes within the long-tailed distribution and inaccurate predictions from previous training stages by encouraging learning classes with low prediction confidence and guarding against forgetting classes with high confidence. This combined approach fosters robust learning while preventing the model from forgetting previously learned knowledge. PAT exhibits significant performance improvements, surpassing the current state-of-the-art by 2.2% in the NyU dataset. Moreover, it enhances overall pixel-wise accuracy by 2.85% and intersection over union value by 2.07%, with a particularly notable declination of 0.39% in detecting rare classes compared to Balance Logits Variation, as demonstrated on the three popular datasets, i.e., OxfordPetIII, CityScape, and NYU.
Paper Structure (18 sections, 5 equations, 5 figures, 6 tables)

This paper contains 18 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Quantitative analysis on the imbalance in mask size among classes. The vertical axis illustrates the mask size calculated by the total number of pixels. The horizontal axis shows different masks that potentially appear in the ground truth. \ref{['fig:a']}) While road and vegetation masks take 50000 pixels and 60000 pixels, respectively, cars account for around 1000 pixels.
  • Figure 2: Overall methodology. 1) Training procedure: an image $x_i$ is fed into an encoder-decoder architecture to produce logits $\hat{x}_i$. Subsequently, $\hat{x}_i$ is adjusted to create a weight tensor, which has the same size as $\hat{x}_i$. The normalized $\hat{x}_i$ is multiplied by the weight tensor to equalize the contribution of each logit to the PAT loss function. 2) Logits Adjustment: Logits vector is normalized by the Softmax, added by a tensor of $-1$, and scaled by the exponential function to find the inverse dominant coefficients $\beta_{i, j}$. Then, $\beta_{i,j}$ is normalized to $[0, 1]$ to form the weight tensor.
  • Figure 3: Fig. \ref{['fig:la']} illustrates the PAT procedure of adjusting the logits' value to tackle the imbalance in dominant probability from categories whose big mask size. Fig. \ref{['fig:nullgrad']} shows the process of adaptive gradient scaling in PAT. Specifically, the channels with no mask can easily be adapted. Therefore, the problem of adaptive gradient scaling can be reduced to two cases. In addition to Fig. \ref{['fig:la']}, Fig. \ref{['fig:vs']} shows the difference in scaling coefficient between PAT and Focalfocal, that PAT (smooth lines) (i) puts a higher weight on low confidence pixel and (ii) keeps low scaling coefficients for high confidence pixels. Otherwise, Focal (dash line), puts zero scalarization on well-classified pixels that may cause forgetfulness of frequent or big mask size categories.
  • Figure 4: Segmentation visualization of models trained by the PAT and other baselines on the CityScapes dataset.
  • Figure 5: Performance comparison between baselines and our proposed method in three different scenarios containing OxfordPetIII, CityScapes, and NyU. Performance metrics include Training Time (seconds/epoch), Average Memory Acquisition shown in Gigabyte (GB) units, and the GPU Utilization proportion (%).