Table of Contents
Fetching ...

Soft labelling for semantic segmentation: Bringing coherence to label down-sampling

Roberto Alcover-Couso, Marcos Escudero-Vinolo, Juan C. SanMiguel, Jose M. Martinez

TL;DR

The paper tackles the problem of down-sampling in semantic segmentation causing misalignment between colour and label information, especially at high down-sampling factors. It introduces soft-labels for label down-sampling and pairs them with the colour sampling to preserve information and align distributions, formalized through one-hot label encodings and region-based soft-label computation. The authors present extensive experiments across Cityscapes, Mapillary, and ADE20K showing that paired soft-label down-sampling yields higher mean IoU than standard baselines while using far fewer resources, often matching or exceeding state-of-the-art results on constrained hardware. This approach enables competitive semantic segmentation in budget-constrained settings and opens pathways for further improvements via soft-colour encoding and broader dataset evaluation.

Abstract

In semantic segmentation, training data down-sampling is commonly performed due to limited resources, the need to adapt image size to the model input, or improve data augmentation. This down-sampling typically employs different strategies for the image data and the annotated labels. Such discrepancy leads to mismatches between the down-sampled color and label images. Hence, the training performance significantly decreases as the down-sampling factor increases. In this paper, we bring together the down-sampling strategies for the image data and the training labels. To that aim, we propose a novel framework for label down-sampling via soft-labeling that better conserves label information after down-sampling. Therefore, fully aligning soft-labels with image data to keep the distribution of the sampled pixels. This proposal also produces reliable annotations for under-represented semantic classes. Altogether, it allows training competitive models at lower resolutions. Experiments show that the proposal outperforms other down-sampling strategies. Moreover, state-of-the-art performance is achieved for reference benchmarks, but employing significantly less computational resources than foremost approaches. This proposal enables competitive research for semantic segmentation under resource constraints.

Soft labelling for semantic segmentation: Bringing coherence to label down-sampling

TL;DR

The paper tackles the problem of down-sampling in semantic segmentation causing misalignment between colour and label information, especially at high down-sampling factors. It introduces soft-labels for label down-sampling and pairs them with the colour sampling to preserve information and align distributions, formalized through one-hot label encodings and region-based soft-label computation. The authors present extensive experiments across Cityscapes, Mapillary, and ADE20K showing that paired soft-label down-sampling yields higher mean IoU than standard baselines while using far fewer resources, often matching or exceeding state-of-the-art results on constrained hardware. This approach enables competitive semantic segmentation in budget-constrained settings and opens pathways for further improvements via soft-colour encoding and broader dataset evaluation.

Abstract

In semantic segmentation, training data down-sampling is commonly performed due to limited resources, the need to adapt image size to the model input, or improve data augmentation. This down-sampling typically employs different strategies for the image data and the annotated labels. Such discrepancy leads to mismatches between the down-sampled color and label images. Hence, the training performance significantly decreases as the down-sampling factor increases. In this paper, we bring together the down-sampling strategies for the image data and the training labels. To that aim, we propose a novel framework for label down-sampling via soft-labeling that better conserves label information after down-sampling. Therefore, fully aligning soft-labels with image data to keep the distribution of the sampled pixels. This proposal also produces reliable annotations for under-represented semantic classes. Altogether, it allows training competitive models at lower resolutions. Experiments show that the proposal outperforms other down-sampling strategies. Moreover, state-of-the-art performance is achieved for reference benchmarks, but employing significantly less computational resources than foremost approaches. This proposal enables competitive research for semantic segmentation under resource constraints.
Paper Structure (25 sections, 5 equations, 11 figures, 8 tables)

This paper contains 25 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Graphical comparison of performance and resolution of prevalent models trained on the Cityscapes dataset Cordts2016Cityscapes. The size of the circle represents the required GPU memory for training. Dashed lines connect equally-setup models trained at different resolutions. For Deeplab and HRNet we consider multiple frameworks which are referenced by the bibliography index. Alternative architectures are specified by their name.
  • Figure 2: Per-pixel cross-entropy loss maps for the HRNetV2 model trained with a Nearest Neighbour down-sampling 9052469 (top). Loss is normalised to the [0,1] range. In the learning process, the missing of a full structure is equally penalised as edge shifting, as can be observed in the highlighted areas, by comparing the loss values in the areas where the output (bottom, left) and the ground truth (bottom, right) segmentation maps differ. Note that a black colour in the ground truth represents pixels without annotation to be discarded during training.
  • Figure 3: Per-class visual comparison between Nearest Neighbour (blue), and our soft-label (blue and yellow) after a $\frac{1}{8}$ down-sampling of the label image in Figure \ref{['fig:loss']}. Note the creation of jagged edges (step-like borders of blue areas) and gaps (discontinuities in blue areas).
  • Figure 4: Visual comparison of the information conserved by down-sampling strategies for a Cityscapes image where $C=19$ classes and the down-sampling factor is $\gamma = (1/32, 1/32)$. Top-row image shows the original-resolution labels. The middle row represents the down-sampled labels for our proposal (right) and the Nearest Neighbour (left). The bottom row depicts the distribution of present semantic classes of the highlighted region $\Omega$ in the original label image (blue) and the ones obtained for the down-sampling of $\Omega$ with Nearest Neighbour (green) and the proposed one (red). Note that $\Omega$ is resized to a single pixel for the down-sampled versions (highlighted in white).
  • Figure 5: Qualitative comparison of our models trained on different input resolutions. First and second rows represent the colour and label images. The third, fourth and fifth rows illustrate the outputs of HRNetV2 models trained with input resolutions of $\frac{1}{2},\frac{1}{4},\frac{1}{8}$ respectively.
  • ...and 6 more figures