SA-UNetv2: Rethinking Spatial Attention U-Net for Retinal Vessel Segmentation
Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Yugen Yi, Morten Rieger Hannemose
TL;DR
The paper addresses retinal vessel segmentation under severe class imbalance, where vessel pixels are typically $<10\%$ of the image. It introduces Cross-scale Spatial Attention (CSA) to all skip connections and a differentiable MCC loss combined with BCE, formulated as $\mathcal{L}_{\mathrm{total}}=\lambda_1\mathcal{L}_{\mathrm{BCE}}+\lambda_2\mathcal{L}_{\mathrm{MCC}}$, with $\mathcal{L}_{\mathrm{BCE}}=-\frac{1}{N}\sum_i[y_i\log p_i+(1-y_i)\log(1-p_i)]$ and $\mathcal{L}_{\mathrm{MCC}}=1-\frac{\mathrm{TP}\cdot\mathrm{TN}-\mathrm{FP}\cdot\mathrm{FN}}{\sqrt{(\mathrm{TP}+\mathrm{FP})(\mathrm{TP}+\mathrm{FN})(\mathrm{TN}+\mathrm{FP})(\mathrm{TN}+\mathrm{FN})}+\epsilon}$ with $\epsilon=10^{-7}$. On DRIVE and STARE, SA-UNetv2 achieves state-of-the-art accuracy while maintaining a lightweight footprint of about $0.26$M parameters and roughly $21$ GFLOPs, enabling sub-second CPU inference and practical deployment in resource-constrained settings. The approach improves fine vascular delineation and demonstrates robust generalization across datasets, offering a significant accuracy-efficiency advantage for clinical retinal image analysis.
Abstract
Retinal vessel segmentation is essential for early diagnosis of diseases such as diabetic retinopathy, hypertension, and neurodegenerative disorders. Although SA-UNet introduces spatial attention in the bottleneck, it underuses attention in skip connections and does not address the severe foreground-background imbalance. We propose SA-UNetv2, a lightweight model that injects cross-scale spatial attention into all skip connections to strengthen multi-scale feature fusion and adopts a weighted Binary Cross-Entropy (BCE) plus Matthews Correlation Coefficient (MCC) loss to improve robustness to class imbalance. On the public DRIVE and STARE datasets, SA-UNetv2 achieves state-of-the-art performance with only 1.2MB memory and 0.26M parameters (less than 50% of SA-UNet), and 1 second CPU inference on 592 x 592 x 3 images, demonstrating strong efficiency and deployability in resource-constrained, CPU-only settings.
