Table of Contents
Fetching ...

Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers

Kuan Liu, Zongyuan Ying, Jie Jin, Dongyan Li, Ping Huang, Wenjian Wu, Zhe Chen, Jin Qi, Yong Lu, Lianfu Deng, Bo Chen

TL;DR

This paper addresses the ill-posed problem of reconstructing 3D bone shapes from 2D biplanar X-rays by introducing Swin-X2S, an end-to-end architecture that uses a 2D Swin Transformer encoder, a dimension-expanding bridge, and a 3D U-Net decoder with cross-view attention to produce 3D segmentation and labeling. It is evaluated across nine public datasets covering four anatomies and 54 categories, using DRR-generated paired X-ray data due to limited real 2D–3D paired datasets, and shows state-of-the-art performance on segmentation, labeling, and clinically relevant morphometry metrics. Ablation studies demonstrate the critical role of cross-view fusion, skip connections, cross losses, pre-training, and data augmentation, while exploring the impact of DRR-view numbers on performance. The results suggest that Swin-X2S offers a practical, scalable option for anatomy-aware 3D reconstruction in clinical workflows, with potential for real-time visualization and extension to non-osseous structures, provided larger, diverse paired datasets become available.

Abstract

The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \url{https://github.com/liukuan5625/Swin-X2S}.

Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers

TL;DR

This paper addresses the ill-posed problem of reconstructing 3D bone shapes from 2D biplanar X-rays by introducing Swin-X2S, an end-to-end architecture that uses a 2D Swin Transformer encoder, a dimension-expanding bridge, and a 3D U-Net decoder with cross-view attention to produce 3D segmentation and labeling. It is evaluated across nine public datasets covering four anatomies and 54 categories, using DRR-generated paired X-ray data due to limited real 2D–3D paired datasets, and shows state-of-the-art performance on segmentation, labeling, and clinically relevant morphometry metrics. Ablation studies demonstrate the critical role of cross-view fusion, skip connections, cross losses, pre-training, and data augmentation, while exploring the impact of DRR-view numbers on performance. The results suggest that Swin-X2S offers a practical, scalable option for anatomy-aware 3D reconstruction in clinical workflows, with potential for real-time visualization and extension to non-osseous structures, provided larger, diverse paired datasets become available.

Abstract

The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \url{https://github.com/liukuan5625/Swin-X2S}.
Paper Structure (23 sections, 9 equations, 8 figures, 6 tables)

This paper contains 23 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The overall architecture of the proposed Swin-X2S network for CT Segmentation and labeling from biplanar X-ray images (${N\text{=}1}$). Swin-X2S takes paired inputs: the coronal view and sagittal view X-ray images. The network generates 2D pyramid features via transformers, which are applied to dimension expansion modules (right dashed box) for upscaling and then reconstruct through 3D U-shaped convolution network.
  • Figure 2: An example of generated coronal group and sagittal group based on DRR method, where the number of projection views $N$ is set to 2.
  • Figure 3: Quantitative reconstruction results of proposed Swin-X2S-Base network with biplanar inputs. Top-left panel: results of Totalsegmentator five subsets (femur, pelvis, spine, rib and all) on a single test sample. The first and second rows respectively exhibit the coronal and sagittal view. The first and last column denote biplanar inputs and ground truth, the other columns illustrate the segmentation results for different anatomies. The blue, orange, green and purple numbers respectively represent Dice, HD, L-error and ID-rate. Top-right panel: result of RibSeg v2 dataset on a single test sample. Bottom-left panel: result of CTPelvic1K. Bottom-center panel: result of CTSpine1K. Bottom-right panel: result of VerSe'19.
  • Figure 4: Reconstruction failure samples of Swin-X2S network on VerSe'19. The first, middle and last column denote biplanar inputs, prediction results and ground truth respectively
  • Figure 5: Comparison results of different methods on the CTPelvic1K, CTSpine1K, Totalsegmentatior-All datasets with biplanar inputs. The first and the last columns of each dataset denote biplanar inputs and ground truth, the other columns illustrate prediction results of different methods.
  • ...and 3 more figures