Table of Contents
Fetching ...

Adaptive Query Prompting for Multi-Domain Landmark Detection

Yuhui Li, Qiusen Wei, Guoheng Huang, Xiaochen Yuan, Xuhang Chen, Guo Zhong, Jianwen Huang, Jiajie Huang

TL;DR

This work tackles the challenge of cross-domain medical landmark detection by proposing Adaptive Query Prompting (AQP), a prompting framework that conditions a frozen vision-transformer backbone with a memory prompt pool and instance-aware prompts. Coupled with Light-MLD, a lightweight decoder, the approach enables a single model to handle head, hand, and chest landmark detection with limited parameter updates. The method combines a prompt pool, a query mechanism based on cosine similarity, and adaptors to generate per-layer prompts, optimized together with heatmap regression losses; empirical results on three X-ray datasets demonstrate competitive or state-of-the-art performance while reducing training cost. The findings suggest that prompt-based conditioning can generalize well across domains and tasks, offering a scalable path toward universal medical landmark detection.

Abstract

Medical landmark detection is crucial in various medical imaging modalities and procedures. Although deep learning-based methods have achieve promising performance, they are mostly designed for specific anatomical regions or tasks. In this work, we propose a universal model for multi-domain landmark detection by leveraging transformer architecture and developing a prompting component, named as Adaptive Query Prompting (AQP). Instead of embedding additional modules in the backbone network, we design a separate module to generate prompts that can be effectively extended to any other transformer network. In our proposed AQP, prompts are learnable parameters maintained in a memory space called prompt pool. The central idea is to keep the backbone frozen and then optimize prompts to instruct the model inference process. Furthermore, we employ a lightweight decoder to decode landmarks from the extracted features, namely Light-MLD. Thanks to the lightweight nature of the decoder and AQP, we can handle multiple datasets by sharing the backbone encoder and then only perform partial parameter tuning without incurring much additional cost. It has the potential to be extended to more landmark detection tasks. We conduct experiments on three widely used X-ray datasets for different medical landmark detection tasks. Our proposed Light-MLD coupled with AQP achieves SOTA performance on many metrics even without the use of elaborate structural designs or complex frameworks.

Adaptive Query Prompting for Multi-Domain Landmark Detection

TL;DR

This work tackles the challenge of cross-domain medical landmark detection by proposing Adaptive Query Prompting (AQP), a prompting framework that conditions a frozen vision-transformer backbone with a memory prompt pool and instance-aware prompts. Coupled with Light-MLD, a lightweight decoder, the approach enables a single model to handle head, hand, and chest landmark detection with limited parameter updates. The method combines a prompt pool, a query mechanism based on cosine similarity, and adaptors to generate per-layer prompts, optimized together with heatmap regression losses; empirical results on three X-ray datasets demonstrate competitive or state-of-the-art performance while reducing training cost. The findings suggest that prompt-based conditioning can generalize well across domains and tasks, offering a scalable path toward universal medical landmark detection.

Abstract

Medical landmark detection is crucial in various medical imaging modalities and procedures. Although deep learning-based methods have achieve promising performance, they are mostly designed for specific anatomical regions or tasks. In this work, we propose a universal model for multi-domain landmark detection by leveraging transformer architecture and developing a prompting component, named as Adaptive Query Prompting (AQP). Instead of embedding additional modules in the backbone network, we design a separate module to generate prompts that can be effectively extended to any other transformer network. In our proposed AQP, prompts are learnable parameters maintained in a memory space called prompt pool. The central idea is to keep the backbone frozen and then optimize prompts to instruct the model inference process. Furthermore, we employ a lightweight decoder to decode landmarks from the extracted features, namely Light-MLD. Thanks to the lightweight nature of the decoder and AQP, we can handle multiple datasets by sharing the backbone encoder and then only perform partial parameter tuning without incurring much additional cost. It has the potential to be extended to more landmark detection tasks. We conduct experiments on three widely used X-ray datasets for different medical landmark detection tasks. Our proposed Light-MLD coupled with AQP achieves SOTA performance on many metrics even without the use of elaborate structural designs or complex frameworks.
Paper Structure (26 sections, 6 equations, 4 figures, 2 tables)

This paper contains 26 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the AQP framework. Compared with typical task-specific methods and existing universal models, which adapt entire model weights to deal with new tasks, AQP uses a single frozen backbone model and learns a prompt pool to instruct the model conditionally.
  • Figure 2: The framework of Light-MLD. The backbone is consist of several feature embedding layers and vision transformer blocks, following by several decoder layers.
  • Figure 3: The architecture of the proposed AQP, where the operator $\boxplus$ denotes concatenate and $\bigoplus$ means element-wise addition. The goal of the Adaptor is to merge selected prompts and align them with the input.
  • Figure 4: Subjective results of the head, hand and chest datasets. All images are randomly selected. The red points $\bullet$ are the landmarks predicted by our model while the green points $\bullet$ are the ground truth labels.