Table of Contents
Fetching ...

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

Nicholas Korcynski

TL;DR

An evaluation protocol is contributed that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing.

Abstract

The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p < 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

TL;DR

An evaluation protocol is contributed that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing.

Abstract

The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only of the image on average, and in addition, the thin-stroke subset averages in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ( vs , ). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ( for Sauvola) but with substantially worse worst-case performance (F1 vs for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.
Paper Structure (63 sections, 5 equations, 13 figures, 7 tables)

This paper contains 63 sections, 5 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Region and boundary metrics per loss function. Dice-based objectives form a clearly separated, higher-performance cluster.
  • Figure 2: Core vs. thin F1 per loss. Dice-family losses narrow the gap, with Tversky showing the most balanced performance.
  • Figure 3: Core--thin F1 gap per loss. A smaller gap indicates more equitable treatment of thin strokes.
  • Figure 4: Resolution comparison for Dice+Focal. Boundary metrics benefit more than region metrics from higher resolution.
  • Figure 5: Classical baseline failure cases. Rows are the three images where Sauvola binarization scores lowest. Columns: original image, ground truth, Sauvola prediction, Tversky prediction, and error overlays (green = TP, red = FN, blue = FP). The deep model provides more consistent segmentation on challenging boards.
  • ...and 8 more figures