Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers
Kuan Liu, Zongyuan Ying, Jie Jin, Dongyan Li, Ping Huang, Wenjian Wu, Zhe Chen, Jin Qi, Yong Lu, Lianfu Deng, Bo Chen
TL;DR
This paper addresses the ill-posed problem of reconstructing 3D bone shapes from 2D biplanar X-rays by introducing Swin-X2S, an end-to-end architecture that uses a 2D Swin Transformer encoder, a dimension-expanding bridge, and a 3D U-Net decoder with cross-view attention to produce 3D segmentation and labeling. It is evaluated across nine public datasets covering four anatomies and 54 categories, using DRR-generated paired X-ray data due to limited real 2D–3D paired datasets, and shows state-of-the-art performance on segmentation, labeling, and clinically relevant morphometry metrics. Ablation studies demonstrate the critical role of cross-view fusion, skip connections, cross losses, pre-training, and data augmentation, while exploring the impact of DRR-view numbers on performance. The results suggest that Swin-X2S offers a practical, scalable option for anatomy-aware 3D reconstruction in clinical workflows, with potential for real-time visualization and extension to non-osseous structures, provided larger, diverse paired datasets become available.
Abstract
The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \url{https://github.com/liukuan5625/Swin-X2S}.
