Table of Contents
Fetching ...

HYATT-Net is Grand: A Hybrid Attention Network for Performant Anatomical Landmark Detection

Xiaoqian Zhou, Zhen Huang, Heqin Zhu, Qingsong Yao, S. Kevin Zhou

TL;DR

HYATT-Net tackles anatomical landmark detection by marrying CNNs with Transformers through a dynamic sparse BiFormer module and an Attention Residual Module (ARM). A Feature Fusion Correction Module (FFCM) and deep supervision further integrate global context with fine-grained local details, delivering accurate, robust ALD across high-resolution medical images. Extensive experiments on head, hand, and pelvic X-ray datasets demonstrate state-of-the-art mean radial error and detection rates, highlighting both accuracy and efficiency benefits. The approach offers practical implications for image-guided procedures and lays groundwork for future 3D extensions and broader clinical deployment.

Abstract

Anatomical landmark detection (ALD) from a medical image is crucial for a wide array of clinical applications. While existing methods achieve quite some success in ALD, they often struggle to balance global context with computational efficiency, particularly with high-resolution images, thereby leading to the rise of a natural question: where is the performance limit of ALD? In this paper, we aim to forge performant ALD by proposing a {\bf HY}brid {\bf ATT}ention {\bf Net}work (HYATT-Net) with the following designs: (i) A novel hybrid architecture that integrates CNNs and Transformers. Its core is the BiFormer module, utilizing Bi-Level Routing Attention for efficient attention to relevant image regions. This, combined with Attention Residual Module(ARM), enables precise local feature refinement guided by the global context. (ii) A Feature Fusion Correction Module that aggregates multi-scale features and thus mitigates a resolution loss. Deep supervision with a mean-square error loss on multi-resolution heatmaps optimizes the model. Experiments on five diverse datasets demonstrate state-of-the-art performance, surpassing existing methods in accuracy, robustness, and efficiency. The HYATT-Net provides a promising solution for accurate and efficient ALD in complex medical images. Our codes and data are already released at: \url{https://github.com/ECNUACRush/HYATT-Net}.

HYATT-Net is Grand: A Hybrid Attention Network for Performant Anatomical Landmark Detection

TL;DR

HYATT-Net tackles anatomical landmark detection by marrying CNNs with Transformers through a dynamic sparse BiFormer module and an Attention Residual Module (ARM). A Feature Fusion Correction Module (FFCM) and deep supervision further integrate global context with fine-grained local details, delivering accurate, robust ALD across high-resolution medical images. Extensive experiments on head, hand, and pelvic X-ray datasets demonstrate state-of-the-art mean radial error and detection rates, highlighting both accuracy and efficiency benefits. The approach offers practical implications for image-guided procedures and lays groundwork for future 3D extensions and broader clinical deployment.

Abstract

Anatomical landmark detection (ALD) from a medical image is crucial for a wide array of clinical applications. While existing methods achieve quite some success in ALD, they often struggle to balance global context with computational efficiency, particularly with high-resolution images, thereby leading to the rise of a natural question: where is the performance limit of ALD? In this paper, we aim to forge performant ALD by proposing a {\bf HY}brid {\bf ATT}ention {\bf Net}work (HYATT-Net) with the following designs: (i) A novel hybrid architecture that integrates CNNs and Transformers. Its core is the BiFormer module, utilizing Bi-Level Routing Attention for efficient attention to relevant image regions. This, combined with Attention Residual Module(ARM), enables precise local feature refinement guided by the global context. (ii) A Feature Fusion Correction Module that aggregates multi-scale features and thus mitigates a resolution loss. Deep supervision with a mean-square error loss on multi-resolution heatmaps optimizes the model. Experiments on five diverse datasets demonstrate state-of-the-art performance, surpassing existing methods in accuracy, robustness, and efficiency. The HYATT-Net provides a promising solution for accurate and efficient ALD in complex medical images. Our codes and data are already released at: \url{https://github.com/ECNUACRush/HYATT-Net}.

Paper Structure

This paper contains 22 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The overview of proposed Hybrid Attention Network(HYATT-Net). BiFormer is a module based on bilevel routing attention, and ARM stands for a Attention Residual Block. Further details will be discussed later. $N$ denote the number of landmarks.
  • Figure 2: (a) Architecture of the BiFormer Block. (b) Architecture of the proposed Attention Residual Block. (c) Overview of the Convolutional Block Attention Module (CBAM).
  • Figure 3: Illustration of region-to-region routing and token-to-token attention. Our approach leverages sparsity by gathering key-value pairs from the top-$k$ related windows, bypassing irrelevant computations and focusing on GPU-friendly dense matrix multiplications for improved efficiency.
  • Figure 4: Visualizations of various methods on the ISBI2015 dataset. The red points represent the predicted landmarks, while the green points correspond to the ground truth labels. Local details are provided below for a clearer comparison of the results. The MRE value is shown in the top-left corner for reference.
  • Figure 5: Visualizations of various methods on the ISBI2023 dataset.
  • ...and 3 more figures