A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

Nicholas Korcynski

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

Nicholas Korcynski

TL;DR

An evaluation protocol is contributed that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing.

Abstract

The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p < 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

TL;DR

Abstract

The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only

of the image on average, and in addition, the thin-stroke subset averages

in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy (

). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 (

for Sauvola) but with substantially worse worst-case performance (F1

for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.

Paper Structure (63 sections, 5 equations, 13 figures, 7 tables)

This paper contains 63 sections, 5 equations, 13 figures, 7 tables.

Introduction
Extreme class imbalance.
Thin-structure failures.
Standard vs. boundary metrics.
Contributions.
Related Work
Document and Whiteboard Segmentation
Thin-Structure Segmentation
Loss Functions for Class Imbalance
Modern Architectures
Boundary-Aware Evaluation
Boundary-Aware and Topology-Aware Losses
Method
Dataset
Offline augmentation.
...and 48 more sections

Figures (13)

Figure 1: Region and boundary metrics per loss function. Dice-based objectives form a clearly separated, higher-performance cluster.
Figure 2: Core vs. thin F1 per loss. Dice-family losses narrow the gap, with Tversky showing the most balanced performance.
Figure 3: Core--thin F1 gap per loss. A smaller gap indicates more equitable treatment of thin strokes.
Figure 4: Resolution comparison for Dice+Focal. Boundary metrics benefit more than region metrics from higher resolution.
Figure 5: Classical baseline failure cases. Rows are the three images where Sauvola binarization scores lowest. Columns: original image, ground truth, Sauvola prediction, Tversky prediction, and error overlays (green = TP, red = FN, blue = FP). The deep model provides more consistent segmentation on challenging boards.
...and 8 more figures

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

TL;DR

Abstract

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

Authors

TL;DR

Abstract

Table of Contents

Figures (13)