Table of Contents
Fetching ...

H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging

Zhen Huang, Tao Tang, Ronghao Xu, Yangbo Wei, Wenkai Yang, Suhua Wang, Xiaoxin Sun, Han Li, Qingsong Yao

TL;DR

The paper tackles 3D landmark detection in medical imaging, where preserving fine-grained local features while modeling global spatial context is computationally challenging. It proposes H3DE-Net, a CNN–Transformer hybrid that uses Volumetric Bi‑Routing Attention (V‑BRA) to capture global dependencies with reduced cost, complemented by a Super-Resolution Block and a Feature Fusion Module for precise localization. The work introduces two architectures—Anchor-Based and Anchor-Free—and provides detailed loss formulations for each. Experiments on a public skull CT dataset show state-of-the-art accuracy and robustness, including scenarios with missing landmarks, demonstrating potential for reliable clinical deployment. The method achieves notable improvements in mean radial error and detection rates across complete, incomplete, and all-case datasets, highlighting its practical value for 3D anatomical localization.

Abstract

3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.

H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging

TL;DR

The paper tackles 3D landmark detection in medical imaging, where preserving fine-grained local features while modeling global spatial context is computationally challenging. It proposes H3DE-Net, a CNN–Transformer hybrid that uses Volumetric Bi‑Routing Attention (V‑BRA) to capture global dependencies with reduced cost, complemented by a Super-Resolution Block and a Feature Fusion Module for precise localization. The work introduces two architectures—Anchor-Based and Anchor-Free—and provides detailed loss formulations for each. Experiments on a public skull CT dataset show state-of-the-art accuracy and robustness, including scenarios with missing landmarks, demonstrating potential for reliable clinical deployment. The method achieves notable improvements in mean radial error and detection rates across complete, incomplete, and all-case datasets, highlighting its practical value for 3D anatomical localization.

Abstract

3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.

Paper Structure

This paper contains 13 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) Architecture of the original Biformer Block. (b) Illustration of region-to-region routing and token-to-token attention. Our method uses sparsity by selecting key-value pairs from the top-$k$ most relevant windows, avoiding unnecessary calculations.
  • Figure 2: Overview of the proposed Hybrid-3D Network (H3DE-Net): Anchor-Based Architectures.
  • Figure 3: (a) shows a single-scale anchor design where $r = 0.5u$ and the number of anchors $n_a = 1$, while (b) shows a multi-scale anchor design where $r = 1u$ and $n_a = 3$. The multi-scale design enhances the coverage of regions surrounding partially missing landmarks, making the model more robust in handling irregular landmark distributions and incomplete data.
  • Figure 4: Landmark detection performance of H3DE-Net train and test on the all datasets. The prefix 'A-' represents the Anchor-Based method, and the way anchors are added is the same as shown in Fig. \ref{['fig: network2']}.