MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

Yingjie Xi; Boyuan Cheng; Jingyao Cai; Jian Jun Zhang; Xiaosong Yang

MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

Yingjie Xi, Boyuan Cheng, Jingyao Cai, Jian Jun Zhang, Xiaosong Yang

TL;DR

MaSkel tackles safe, noninvasive generation of whole-body X-ray references by learning a mapping from human masking images to pseudo-X-ray images. The method combines a two-stage training framework with MAE-based latent encoding and VQ-VAE decoding, guided by diffusion-augmented synthetic data to achieve anatomically coherent X-rays aligned with input poses. Quantitative metrics and clinician assessments demonstrate high structural fidelity and perceptual realism, while real-world clothed-image tests reveal current limitations and avenues for generalization. This work provides a scalable, noninvasive data source for medical education, digital anatomy, and ergonomic design, with plans to integrate a 3D skeleton model for expanded capabilities.

Abstract

The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it lacks adaptability and safety. In our work, We proposed a new method to directly generate the 2D human whole-body X-rays from the human masking images. The predicted images will be similar to the real ones with the same image style and anatomic structure. We employed a data-driven strategy. By leveraging advanced generative techniques, our model MaSkel(Masking image to Skeleton X-rays) could generate a high-quality X-ray image from a human masking image without the need for invasive and harmful radiation exposure, which not only provides a new path to generate highly anatomic and customized data but also reduces health risks. To our knowledge, our model MaSkel is the first work for predicting whole-body X-rays. In this paper, we did two parts of the work. The first one is to solve the data limitation problem, the diffusion-based techniques are utilized to make a data augmentation, which provides two synthetic datasets for preliminary pretraining. Then we designed a two-stage training strategy to train MaSkel. At last, we make qualitative and quantitative evaluations of the generated X-rays. In addition, we invite some professional doctors to assess our predicted data. These evaluations demonstrate the MaSkel's superior ability to generate anatomic X-rays from human masking images. The related code and links of the dataset are available at https://github.com/2022yingjie/MaSkel.

MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

TL;DR

Abstract

Paper Structure (15 sections, 9 equations, 8 figures, 1 table)

This paper contains 15 sections, 9 equations, 8 figures, 1 table.

Introduction
Related Work
Generating Human Skeleton
Diffusion-based Image Generation
Diffusion-based Image Super Resolution
MAE and VQ-VAEs
Method
Two-stage Training
The Structure of MaSkel
Experiments
Images Augmentation
The First-stage Training
The Second-stage Training
Evaluation on the Real-World Data
Conclusion

Figures (8)

Figure 1: Paired of X-ray images(left), human soft-tissue images(middle), and masking images(right).
Figure 2: The left part is the details of the first stage. The right part describes the second stage structure. In the first stage, we utilize the MAE strategy to train an encoder for compressing X-rays. In the second stage, the MAE encoder accepts X-rays and outputs their latent representation as feature Ground Truth, then the other encoder maps the masking images to similar feature space. The masking feature will be fed into the decoder to generate X-ray images by using the VQ-VAE method.
Figure 3: $64\times64$ resolution X-rays generation. This upper line is generated data and the lower line is real images. It can be seen that the generated images have the same style and structure as the real image.
Figure 4: $256\times256$ resolution X-rays generation after super-resolution augmentation. This upper line is low-resolution data and the lower line is data with higher resolution. It can be seen that the high-resolution images have the same join and bone connection as the low-resolution ones but with far more details.
Figure 5: The first line is the reconstructed data, and the second line shows the original data. It can be seen that the reconstructed images are highly close to the original ones, which means the encoder could enable a low-loss compression.
...and 3 more figures

MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

TL;DR

Abstract

MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

Authors

TL;DR

Abstract

Table of Contents

Figures (8)