Towards Clinician-Preferred Segmentation: Leveraging Human-in-the-Loop for Test Time Adaptation in Medical Image Segmentation
Shishuai Hu, Zehui Liao, Zeyou Liu, Yong Xia
TL;DR
This work tackles cross-center distribution shifts in medical image segmentation by introducing HiTTA, a Human-in-the-loop Test Time Adaptation framework that combines a BN-parameter divergence loss with clinician-corrected feedback. It operates in three stages: pre-inference style augmentation to adapt BN parameters via the divergence loss $\mathcal{L}_{div}$, inference with clinician correction of predictions $\hat{y}_i^t$ to $y_i^t$, and post-inference with a preference head $\mathcal{H}_{\theta_i^h}$ trained using $\mathcal{L}_{seg}$ and weighted by $1+\mathcal{M}_{div}$ to reflect human feedback. Evaluated on the cross-domain, multi-annotator OD/OC segmentation dataset RIGA+ with Dice Similarity Coefficient as the metric, HiTTA outperforms eight baselines, and ablation studies confirm the critical roles of both the divergence loss and the human-in-the-loop late-stage optimization. The results demonstrate that incorporating clinician feedback into TTA improves clinical alignment and generalization across medical centers, offering a path toward more practical, human-aware AI-assisted diagnostic tools in ophthalmic imaging and beyond.
Abstract
Deep learning-based medical image segmentation models often face performance degradation when deployed across various medical centers, largely due to the discrepancies in data distribution. Test Time Adaptation (TTA) methods, which adapt pre-trained models to test data, have been employed to mitigate such discrepancies. However, existing TTA methods primarily focus on manipulating Batch Normalization (BN) layers or employing prompt and adversarial learning, which may not effectively rectify the inconsistencies arising from divergent data distributions. In this paper, we propose a novel Human-in-the-loop TTA (HiTTA) framework that stands out in two significant ways. First, it capitalizes on the largely overlooked potential of clinician-corrected predictions, integrating these corrections into the TTA process to steer the model towards predictions that coincide more closely with clinical annotation preferences. Second, our framework conceives a divergence loss, designed specifically to diminish the prediction divergence instigated by domain disparities, through the careful calibration of BN parameters. Our HiTTA is distinguished by its dual-faceted capability to acclimatize to the distribution of test data whilst ensuring the model's predictions align with clinical expectations, thereby enhancing its relevance in a medical context. Extensive experiments on a public dataset underscore the superiority of our HiTTA over existing TTA methods, emphasizing the advantages of integrating human feedback and our divergence loss in enhancing the model's performance and adaptability across diverse medical centers.
