Table of Contents
Fetching ...

FM-OSD: Foundation Model-Enabled One-Shot Detection of Anatomical Landmarks

Juzheng Miao, Cheng Chen, Keli Zhang, Jie Chuai, Quanzheng Li, Pheng-Ann Heng

TL;DR

This work tackles the challenge of anatomical landmark detection when labeled data are scarce by leveraging frozen visual foundation-model encoders. It introduces FM-OSD, a coarse-to-fine framework that adds lightweight global and local decoders and a distance-aware similarity loss, coupled with a bidirectional matching strategy to robustly localize landmarks from a single template image. The approach achieves state-of-the-art performance on two public X-ray datasets, improving mean radial error by over 16% relative to strong one-shot baselines while avoiding any unlabeled data. By removing reliance on large unlabeled datasets, FM-OSD increases the practicality of one-shot landmark detection in clinical practice, with future work aimed at extending to 3D data and cross-modality scenarios.

Abstract

One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unavailable. In this paper, we propose the first foundation model-enabled one-shot landmark detection (FM-OSD) framework for accurate landmark detection in medical images by utilizing solely a single template image without any additional unlabeled data. Specifically, we use the frozen image encoder of visual foundation models as the feature extractor, and introduce dual-branch global and local feature decoders to increase the resolution of extracted features in a coarse to fine manner. The introduced feature decoders are efficiently trained with a distance-aware similarity learning loss to incorporate domain knowledge from the single template image. Moreover, a novel bidirectional matching strategy is developed to improve both robustness and accuracy of landmark detection in the case of scattered similarity map obtained by foundation models. We validate our method on two public anatomical landmark detection datasets. By using solely a single template image, our method demonstrates significant superiority over strong state-of-the-art one-shot landmark detection methods.

FM-OSD: Foundation Model-Enabled One-Shot Detection of Anatomical Landmarks

TL;DR

This work tackles the challenge of anatomical landmark detection when labeled data are scarce by leveraging frozen visual foundation-model encoders. It introduces FM-OSD, a coarse-to-fine framework that adds lightweight global and local decoders and a distance-aware similarity loss, coupled with a bidirectional matching strategy to robustly localize landmarks from a single template image. The approach achieves state-of-the-art performance on two public X-ray datasets, improving mean radial error by over 16% relative to strong one-shot baselines while avoiding any unlabeled data. By removing reliance on large unlabeled datasets, FM-OSD increases the practicality of one-shot landmark detection in clinical practice, with future work aimed at extending to 3D data and cross-modality scenarios.

Abstract

One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unavailable. In this paper, we propose the first foundation model-enabled one-shot landmark detection (FM-OSD) framework for accurate landmark detection in medical images by utilizing solely a single template image without any additional unlabeled data. Specifically, we use the frozen image encoder of visual foundation models as the feature extractor, and introduce dual-branch global and local feature decoders to increase the resolution of extracted features in a coarse to fine manner. The introduced feature decoders are efficiently trained with a distance-aware similarity learning loss to incorporate domain knowledge from the single template image. Moreover, a novel bidirectional matching strategy is developed to improve both robustness and accuracy of landmark detection in the case of scattered similarity map obtained by foundation models. We validate our method on two public anatomical landmark detection datasets. By using solely a single template image, our method demonstrates significant superiority over strong state-of-the-art one-shot landmark detection methods.
Paper Structure (7 sections, 2 equations, 5 figures, 2 tables)

This paper contains 7 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our proposed method. In training, two light decoders are updated using the distance-aware similarity learning, while the features are frozen and a bidirectional matching strategy on top of the combination of global and local features is adopted to find robust landmark predictions on query images.
  • Figure 2: Visualizations of different methods on the head and hand dataset. Red and green points indicate predicted landmarks and ground-truth labels, respectively.
  • Figure 2: Ablation studies of different components of our method on the hand dataset.
  • Figure 3: Effects of various losses for training $\mathcal{D}_G$.
  • Figure 4: Zero-shot performances using various (a) models, (b) heads, and (c) layers.