Table of Contents
Fetching ...

SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging

Hao Xie, Zixun Huang, Yushen Zuo, Yakun Ju, Frank H. F. Leung, N. F. Law, Kin-Man Lam, Yong-Ping Zheng, Sai Ho Ling

TL;DR

This work addresses the challenge of accurate spine segmentation from radiation-free ultrasound volume projection imaging by introducing SA$^{2}$Net, a scale-adaptive framework that fuses cross-dimensional channel-spatial attention with a structure-aware Transformer decoder. The model combines a scale-adaptive channel-spatial attention module (SACSAM) to capture long-range dependencies with a structure-affinity transformation (via a Transformer-based structure-aware module, SAM) that imposes class-specific anatomical priors on semantic features. A feature mixing loss aggregation further strengthens training by supervising multiple predictions and encouraging robust, structure-conscious segmentation. Empirical results on a spine ultrasound VPI dataset demonstrate state-of-the-art performance across CNN and Transformer backbones, with notable improvements in boundary delineation and inter-class separability, suggesting strong potential for clinical automated scoliosis diagnosis and monitoring.

Abstract

Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.

SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging

TL;DR

This work addresses the challenge of accurate spine segmentation from radiation-free ultrasound volume projection imaging by introducing SANet, a scale-adaptive framework that fuses cross-dimensional channel-spatial attention with a structure-aware Transformer decoder. The model combines a scale-adaptive channel-spatial attention module (SACSAM) to capture long-range dependencies with a structure-affinity transformation (via a Transformer-based structure-aware module, SAM) that imposes class-specific anatomical priors on semantic features. A feature mixing loss aggregation further strengthens training by supervising multiple predictions and encouraging robust, structure-conscious segmentation. Empirical results on a spine ultrasound VPI dataset demonstrate state-of-the-art performance across CNN and Transformer backbones, with notable improvements in boundary delineation and inter-class separability, suggesting strong potential for clinical automated scoliosis diagnosis and monitoring.

Abstract

Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SANet) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SANet achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SANet to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.

Paper Structure

This paper contains 27 sections, 19 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of spine segmentation from ultrasound VPI images. (a) Spinal 3D ultrasound volume data, which are in the form of an ultrasound sequence of 2D slices; (b) One extracted ultrasound VPI image based on the projection on the 2D coronal plane; and (c) Different bone features in the spinal image. The segmented rib and thoracic process are painted red and green, respectively. The lump, which is formed by the combined shadow of the partial bilateral inferior articular process, laminae, and the superior articular process of the inferior vertebrae, is painted blue.
  • Figure 2: (a) An overview of the proposed SA$^{2}$Net. $x$ represents an input spinal image while $\hat{y}$ denotes the predicted segmentation result. $\mathcal{L} _{total}$ is the optimized loss during the training process; (b) An illustration of the specially designed Transformer decoder with cross-attention. The query features are input into this Transformer decoder and are updated with multi-scale image features, generating semantic class masks.
  • Figure 3: Details of the scale-adaptive channel-spatial attention module (SACSAM). $B$ denotes the batch size, $C$ represents the number of channels, and $H$ and $W$ correspond to the height and width of the input feature map $X$, respectively.
  • Figure 4: The overall structure of the Structure-Affinity Transformation module. The key is to calculate structure-affinity attention weights for structure-aware feature, with pixel-level classification confidence applied to class-specific affinity.
  • Figure 5: Qualitative spine bone segmentation comparisons on ultrasound VPI images based on different methods. The segmented rib, thoracic process, and lump are annotated in red, green, and blue, respectively. The yellow rectangular box highlights the area around the boundary of the thoracic and lumbar region, while the red rectangular box highlights a part of the lumbar vertebra. The orange circle marks the defect parts of the segmentation results.