Assessing the generalization performance of SAM for ureteroscopy scene understanding
Martin Villagrana, Francisco Lopez-Tiro, Clement Larose, Gilberto Ochoa-Ruiz, Christian Daul
TL;DR
This work addresses the challenge of robust kidney stone segmentation in ureteroscopy images by evaluating the Segment Anything Model (SAM) against traditional U‑Net variants. Using four diverse datasets and two- and three-class configurations, SAM is trained on a single distribution and tested on both in-distribution and out-of-distribution data, demonstrating competitive in-distribution performance ($\text{Accuracy}=97.68\pm3.04$, $\text{Dice}=97.78\pm2.47$, $\text{IoU}=95.76\pm4.18$) while delivering significantly stronger generalization on unseen data (outperforming all U‑Net variants by up to $23\%$ IoU). A two-phase SAM setup enables effective three-class segmentation (stone, laser fiber, tissue) without retraining, achieving high cross-domain accuracy. Overall, SAM exhibits precise, artifact-free segmentations and robust cross-dataset transfer, signaling strong potential for scalable clinical image analysis in variable ureteroscopic environments.
Abstract
The segmentation of kidney stones is regarded as a critical preliminary step to enable the identification of urinary stone types through machine- or deep-learning-based approaches. In urology, manual segmentation is considered tedious and impractical due to the typically large scale of image databases and the continuous generation of new data. In this study, the potential of the Segment Anything Model (SAM) -- a state-of-the-art deep learning framework -- is investigated for the automation of kidney stone segmentation. The performance of SAM is evaluated in comparison to traditional models, including U-Net, Residual U-Net, and Attention U-Net, which, despite their efficiency, frequently exhibit limitations in generalizing to unseen datasets. The findings highlight SAM's superior adaptability and efficiency. While SAM achieves comparable performance to U-Net on in-distribution data (Accuracy: 97.68 + 3.04; Dice: 97.78 + 2.47; IoU: 95.76 + 4.18), it demonstrates significantly enhanced generalization capabilities on out-of-distribution data, surpassing all U-Net variants by margins of up to 23 percent.
