DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

Ron Keuth; Lasse Hansen; Maren Balks; Ronja Jäger; Anne-Nele Schröder; Ludger Tüshaus; Mattias Heinrich

DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

Ron Keuth, Lasse Hansen, Maren Balks, Ronja Jäger, Anne-Nele Schröder, Ludger Tüshaus, Mattias Heinrich

TL;DR

DenseSeg addresses the challenge of simultaneously performing semantic segmentation and landmark detection in medical images by introducing a dense image-to-shape representation based on $uv$-maps. It uses a two-head UNet to jointly predict segmentation and $uv$-maps, optimizing a multi-term loss that includes $L_ ext{BCE}$, $L_ extphi$, $L_ ext{LM}$, and $L_ ext{TV}$, enabling explicit anatomical correspondences without requiring landmark-specific training. The method achieves competitive landmark accuracy on the jsrt thorax dataset and superior performance on the Graz pediatric wrist dataset, while also allowing new landmarks to be added without retraining, highlighting practical flexibility. These results illustrate the value of a dense geometric representation for challenging landmark detection tasks and demonstrate potential for extending to additional anatomical structures and clinical applications.

Abstract

Purpose: Semantic segmentation and landmark detection are fundamental tasks of medical image processing, facilitating further analysis of anatomical objects. Although deep learning-based pixel-wise classification has set a new-state-of-the-art for segmentation, it falls short in landmark detection, a strength of shape-based approaches. Methods: In this work, we propose a dense image-to-shape representation that enables the joint learning of landmarks and semantic segmentation by employing a fully convolutional architecture. Our method intuitively allows the extraction of arbitrary landmarks due to its representation of anatomical correspondences. We benchmark our method against the state-of-the-art for semantic segmentation (nnUNet), a shape-based approach employing geometric deep learning and a convolutional neural network-based method for landmark detection. Results: We evaluate our method on two medical dataset: one common benchmark featuring the lungs, heart, and clavicle from thorax X-rays, and another with 17 different bones in the paediatric wrist. While our method is on pair with the landmark detection baseline in the thorax setting (error in mm of $2.6\pm0.9$ vs $2.7\pm0.9$), it substantially surpassed it in the more complex wrist setting ($1.1\pm0.6$ vs $1.9\pm0.5$). Conclusion: We demonstrate that dense geometric shape representation is beneficial for challenging landmark detection tasks and outperforms previous state-of-the-art using heatmap regression. While it does not require explicit training on the landmarks themselves, allowing for the addition of new landmarks without necessitating retraining.}

DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

TL;DR

DenseSeg addresses the challenge of simultaneously performing semantic segmentation and landmark detection in medical images by introducing a dense image-to-shape representation based on

-maps. It uses a two-head UNet to jointly predict segmentation and

-maps, optimizing a multi-term loss that includes

, and

, enabling explicit anatomical correspondences without requiring landmark-specific training. The method achieves competitive landmark accuracy on the jsrt thorax dataset and superior performance on the Graz pediatric wrist dataset, while also allowing new landmarks to be added without retraining, highlighting practical flexibility. These results illustrate the value of a dense geometric representation for challenging landmark detection tasks and demonstrate potential for extending to additional anatomical structures and clinical applications.

Abstract

), it substantially surpassed it in the more complex wrist setting (

). Conclusion: We demonstrate that dense geometric shape representation is beneficial for challenging landmark detection tasks and outperforms previous state-of-the-art using heatmap regression. While it does not require explicit training on the landmarks themselves, allowing for the addition of new landmarks without necessitating retraining.}

Paper Structure (24 sections, 5 equations, 5 figures, 2 tables)

This paper contains 24 sections, 5 equations, 5 figures, 2 tables.

Introduction and Related Work
Landmark Detection
Semantic Segmentation
Dense Representation for Landmark Detection
Methods
Problem Definition
Generation of $uv$-Maps
Landmark Extraction from $uv$-Maps
Loss Function
Experiments
Datasets
Bone Age Regression with Landmarks
Addition of New Landmarks without Retraining
Network Architecture
Training and Hyperparameter Search
...and 9 more sections

Figures (5)

Figure 1: $uv$-Map generation: the corresponding landmarks $\mathbf{L}_{\omega_n}$ of an anatomical structure $\omega_n$ are aligned to a template $\mathbf{T}_{\omega_n}$ yielding a sparse displacement field $\mathbf{\Theta}_{\omega_n}$, which is interpolated to a dense one $\varphi_{\omega_n}$. The final step involves warping the template's $uv$-map $\mathbf{U}_{\omega_n}\circ\varphi_{\omega_n}$ to generate the $uv$-map $\mathbf{U}_{\omega_n}'$ that will be used for supervision during the training. $\textcolor{orange}{\blacklozenge}$ markers in $\mathbf{T}$ correspond to unknown landmarks not used in training (more details in Sec. \ref{['sec:unknown_landmarks']}).
Figure 2: Two-head UNet for joint semantic segmentation $f_s$ and $uv$-mapping $f_\varphi$. A blue box symbolizes a residual block that contains two residual units with the label indicating the number of its output channels. The UNet extracts image features, thereby creating a canonical feature space both heads can utilize for segmentation and $uv$-mapping, respectively.
Figure 3: Qualitative result on jsrt with the best, median and worst test case (from left to right). asd and tre are provided in mm. Various colors are used to distinguish different anatomical structures, with small red dots indicating the ground truth.
Figure 4: Qualitative result on graz with the best, median and worst test case (from left to right). asd and tre are provided in pixel. Various colors are used to distinguish different anatomical structures, with small red dots indicating the ground truth.
Figure 5: Results for bone age regression with landmarks provided by different methods. Left: the error plot when using landmarks generated by our method. Right: mean absolute age estimation error (MAE) and $R^2$ correlation score.

DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

TL;DR

Abstract

DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)