Table of Contents
Fetching ...

MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection

Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz, Lizhuo Lin, Yanqi Yang, Karim Lekadir, Xiaomeng Li

TL;DR

This work investigates adapting a human-centric foundation model trained for pose estimation to medical imaging landmark detection. MedSapiens uses a Sapiens vision transformer backbone with LoRA-based fine-tuning on a harmonized, multi-dataset collection of anatomical landmarks and a heatmap-based decoding head to localize up to N landmarks per image. The approach achieves state-of-the-art results compared to both generalist and specialist baselines across multiple datasets and demonstrates notable few-shot generalization on a dental landmark task, indicating strong cross-task adaptability. By preserving spatial priors from large-scale pretraining while efficiently adapting to medical data, MedSapiens offers a practical path toward robust, data-efficient anatomical landmark detection in clinical settings.

Abstract

This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .

MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection

TL;DR

This work investigates adapting a human-centric foundation model trained for pose estimation to medical imaging landmark detection. MedSapiens uses a Sapiens vision transformer backbone with LoRA-based fine-tuning on a harmonized, multi-dataset collection of anatomical landmarks and a heatmap-based decoding head to localize up to N landmarks per image. The approach achieves state-of-the-art results compared to both generalist and specialist baselines across multiple datasets and demonstrates notable few-shot generalization on a dental landmark task, indicating strong cross-task adaptability. By preserving spatial priors from large-scale pretraining while efficiently adapting to medical data, MedSapiens offers a practical path toward robust, data-efficient anatomical landmark detection in clinical settings.

Abstract

This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .

Paper Structure

This paper contains 8 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) Sapiens khirodkar2024sapiens, a foundation model devised for vision-centric tasks such as pose estimation. (b) Anatomical landmark detection shares a hierarchical structure with human-centric tasks. (c) By adapting Sapiens for anatomical landmark detection, MedSapiens surpasses the existing SOTA generalist model UniverDetect univerdetect2024_landmark.
  • Figure 2: Overall framework for MedSapiens.
  • Figure 3: Qualitative results of NFDP and MedSapiens on the four testing datasets, where red points indicate predicted landmarks and green points represent ground-truth labels.
  • Figure 4: Qualitative comparison of our MedSapiens method with existing baselines: (a) Results on dental images, where red points indicate predicted landmarks and green points represent ground-truth labels. Our MedSapiens achieves superior alignment with the ground truth compared to other methods. (b) Convergence analysis of MedSapiens vs. Sapiens in terms of end-point error across epochs. Our method exhibits faster and more stable convergence with lower error.