MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection
Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz, Lizhuo Lin, Yanqi Yang, Karim Lekadir, Xiaomeng Li
TL;DR
This work investigates adapting a human-centric foundation model trained for pose estimation to medical imaging landmark detection. MedSapiens uses a Sapiens vision transformer backbone with LoRA-based fine-tuning on a harmonized, multi-dataset collection of anatomical landmarks and a heatmap-based decoding head to localize up to N landmarks per image. The approach achieves state-of-the-art results compared to both generalist and specialist baselines across multiple datasets and demonstrates notable few-shot generalization on a dental landmark task, indicating strong cross-task adaptability. By preserving spatial priors from large-scale pretraining while efficiently adapting to medical data, MedSapiens offers a practical path toward robust, data-efficient anatomical landmark detection in clinical settings.
Abstract
This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .
