Table of Contents
Fetching ...

Bayesian Active Learning for Semantic Segmentation

Sima Didari, Wenjun Hu, Jae Oh Woo, Heng Hao, Hankyu Moon, Seungjai Min

TL;DR

The paper tackles the high annotation cost of semantic segmentation by proposing BalEntAcq, a Bayesian active learning framework that uses sparse pixel annotations. It introduces a pixel-wise BalEnt uncertainty measure, approximated via a Beta distribution from last-layer dropout, and an acquisition rule that promotes informative, diverse samples along entropy contours. Empirical results across Cityscapes, CamVid, ADE20K, and VOC2012 show BalEntAcq achieving supervised-like mIoU with only a tiny fraction of labeled pixels, outperforming prior AL methods on multiple backbones. The approach is backbone-agnostic, scalable, and effectively balances model, data, and posterior uncertainties to enable robust, data-efficient segmentation.

Abstract

Fully supervised training of semantic segmentation models is costly and challenging because each pixel within an image needs to be labeled. Therefore, the sparse pixel-level annotation methods have been introduced to train models with a subset of pixels within each image. We introduce a Bayesian active learning framework based on sparse pixel-level annotation that utilizes a pixel-level Bayesian uncertainty measure based on Balanced Entropy (BalEnt) [84]. BalEnt captures the information between the models' predicted marginalized probability distribution and the pixel labels. BalEnt has linear scalability with a closed analytical form and can be calculated independently per pixel without relational computations with other pixels. We train our proposed active learning framework for Cityscapes, Camvid, ADE20K and VOC2012 benchmark datasets and show that it reaches supervised levels of mIoU using only a fraction of labeled pixels while outperforming the previous state-of-the-art active learning models with a large margin.

Bayesian Active Learning for Semantic Segmentation

TL;DR

The paper tackles the high annotation cost of semantic segmentation by proposing BalEntAcq, a Bayesian active learning framework that uses sparse pixel annotations. It introduces a pixel-wise BalEnt uncertainty measure, approximated via a Beta distribution from last-layer dropout, and an acquisition rule that promotes informative, diverse samples along entropy contours. Empirical results across Cityscapes, CamVid, ADE20K, and VOC2012 show BalEntAcq achieving supervised-like mIoU with only a tiny fraction of labeled pixels, outperforming prior AL methods on multiple backbones. The approach is backbone-agnostic, scalable, and effectively balances model, data, and posterior uncertainties to enable robust, data-efficient segmentation.

Abstract

Fully supervised training of semantic segmentation models is costly and challenging because each pixel within an image needs to be labeled. Therefore, the sparse pixel-level annotation methods have been introduced to train models with a subset of pixels within each image. We introduce a Bayesian active learning framework based on sparse pixel-level annotation that utilizes a pixel-level Bayesian uncertainty measure based on Balanced Entropy (BalEnt) [84]. BalEnt captures the information between the models' predicted marginalized probability distribution and the pixel labels. BalEnt has linear scalability with a closed analytical form and can be calculated independently per pixel without relational computations with other pixels. We train our proposed active learning framework for Cityscapes, Camvid, ADE20K and VOC2012 benchmark datasets and show that it reaches supervised levels of mIoU using only a fraction of labeled pixels while outperforming the previous state-of-the-art active learning models with a large margin.
Paper Structure (18 sections, 5 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Bayesian semantic segmentation model with last layer dropout and pixel-based loss. (2) Model's inference output $[W, H,C, m]$. Where $C, m$ are the number of classes & Monte Carlo samples. Per pixel probability distributions obtained from Monte Carlo model's forward pass with dropout are shown in histogram. (3) BalEnt acquisition function. Per-pixel probability distributions are approximated by a Beta distribution, shown in red curves. Per-pixel BalEnt values are calculated, and $n$ pixels (here $n$=4) with the largest BalEnt values are selected, shown with red dots. (4) Click-based annotation tool.
  • Figure 2: Top row, left to right: Input image, its ground truth, supervised training prediction, BalEntAcq AL prediction, DeepLab, $n$=5. Bottom row: Uncertainty maps from left to right: BalEnt, pBALD, BALD, & $P_{marg}$. Brighter intensity represents higher uncertainty values.
  • Figure 3: The BalEntAcq AL comparison to the existing acquisition functions, ADE20K, VOC2012, Cityscapes, & CamVid (from top to right). DeepLabV3+ MobileNetv2, $n$=10 for ADE20K, $n$=5 for the rest of datasets.
  • Figure 4: The BAlEntAcq AL comparison to existing acquisition functions, Cityscapes, CamVid & VOC2012 (from left to right), FPN ResNet50, $n$=5.
  • Figure 5: Normalized epistemic, aleatoric & posterior uncertainties with their values at the first cycle versus the validation dataset mIoU for Cityscapes, CamVid & VOC2012 (from left to right), DeepLabV3+ MobileNetv2, $n$=$5$, each point represents an AL cycle.
  • ...and 2 more figures