Table of Contents
Fetching ...

Feasibility Study of a Diffusion-Based Model for Cross-Modal Generation of Knee MRI from X-ray: Integrating Radiographic Feature Information

Zhe Wang, Yung Hsin Chen, Aladine Chetouani, Fabian Bauer, Yuhua Ru, Fang Chen, Liping Zhang, Rachid Jennane, Mohamed Jarraya

TL;DR

This study tackles the gap between knee X-ray accessibility and MRI's soft-tissue diagnostic detail by proposing a diffusion-based cross-modal framework to synthesize knee MRI volumes from X-ray inputs. The approach combines a classical conditional latent diffusion model with an AutoencoderKL-based latent space and a guidance module that injects target depth and patient-specific radiographic features, enabling 3D MRI volume generation from 2D inputs. Results show MRI slices generated by the method are visually closer to real MRI and exhibit improved region-specific fidelity (e.g., KOA-related features) compared with state-of-the-art diffusion baselines; ablations confirm the added value of radiographic guidance, and increasing inference steps improves inter-slice continuity. The work demonstrates a feasible, data-driven bridge between X-ray and MRI modalities, with potential to enhance access to MRI-like insights in resource-limited settings, while acknowledging that the generated MRI is not a clinical replacement and relies on large paired datasets and substantial compute.

Abstract

Knee osteoarthritis (KOA) is a prevalent musculoskeletal disorder, often diagnosed using X-rays due to its cost-effectiveness. While Magnetic Resonance Imaging (MRI) provides superior soft tissue visualization and serves as a valuable supplementary diagnostic tool, its high cost and limited accessibility significantly restrict its widespread use. To explore the feasibility of bridging this imaging gap, we conducted a feasibility study leveraging a diffusion-based model that uses an X-ray image as conditional input, alongside target depth and additional patient-specific feature information, to generate corresponding MRI sequences. Our findings demonstrate that the MRI volumes generated by our approach is visually closer to real MRI scans. Moreover, increasing inference steps enhances the continuity and smoothness of the synthesized MRI sequences. Through ablation studies, we further validate that integrating supplementary patient-specific information, beyond what X-rays alone can provide, enhances the accuracy and clinical relevance of the generated MRI, which underscores the potential of leveraging external patient-specific information to improve the MRI generation. This study is available at https://zwang78.github.io/.

Feasibility Study of a Diffusion-Based Model for Cross-Modal Generation of Knee MRI from X-ray: Integrating Radiographic Feature Information

TL;DR

This study tackles the gap between knee X-ray accessibility and MRI's soft-tissue diagnostic detail by proposing a diffusion-based cross-modal framework to synthesize knee MRI volumes from X-ray inputs. The approach combines a classical conditional latent diffusion model with an AutoencoderKL-based latent space and a guidance module that injects target depth and patient-specific radiographic features, enabling 3D MRI volume generation from 2D inputs. Results show MRI slices generated by the method are visually closer to real MRI and exhibit improved region-specific fidelity (e.g., KOA-related features) compared with state-of-the-art diffusion baselines; ablations confirm the added value of radiographic guidance, and increasing inference steps improves inter-slice continuity. The work demonstrates a feasible, data-driven bridge between X-ray and MRI modalities, with potential to enhance access to MRI-like insights in resource-limited settings, while acknowledging that the generated MRI is not a clinical replacement and relies on large paired datasets and substantial compute.

Abstract

Knee osteoarthritis (KOA) is a prevalent musculoskeletal disorder, often diagnosed using X-rays due to its cost-effectiveness. While Magnetic Resonance Imaging (MRI) provides superior soft tissue visualization and serves as a valuable supplementary diagnostic tool, its high cost and limited accessibility significantly restrict its widespread use. To explore the feasibility of bridging this imaging gap, we conducted a feasibility study leveraging a diffusion-based model that uses an X-ray image as conditional input, alongside target depth and additional patient-specific feature information, to generate corresponding MRI sequences. Our findings demonstrate that the MRI volumes generated by our approach is visually closer to real MRI scans. Moreover, increasing inference steps enhances the continuity and smoothness of the synthesized MRI sequences. Through ablation studies, we further validate that integrating supplementary patient-specific information, beyond what X-rays alone can provide, enhances the accuracy and clinical relevance of the generated MRI, which underscores the potential of leveraging external patient-specific information to improve the MRI generation. This study is available at https://zwang78.github.io/.

Paper Structure

This paper contains 27 sections, 10 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: The structure of the classical conditional latent diffusion model.
  • Figure 2: The flowchart of the proposed approach, starts by extracting a slice at position $d$ from a real MRI sequence $x$, denoted as $x_d$. This slice is encoded by the encoder $\mathcal{E}_1$ to produce a latent representation $z_0$, which is progressively noised following the distribution $q(z_t | z_{t-1})$, ultimately generating the noised latent representation $z_T$. The corresponding X-ray image $y$ is then used as a conditional input, encoded by $\mathcal{E}_2$, and concatenated with $z_T$ to form the initial input for the downsampling network. During the denoising process, the target depth $d$ and the patient-specific radiographic feature information $r$ are jointly embedded at various stages of the downsampling and upsampling phases, collectively providing denoising guidance. This denoising process is repeated for $T-1$ iterations to progressively refine the latent representation, resulting in the denoised latent $\hat{z}_0$, which is decoded by $\mathcal{D}$ to reconstruct the MRI slice $\hat{x}_d$ at depth $d$. During inference, the process begins with $z_T$ and is repeated $S$ times, corresponding to the number of channels $S$ in the original MRI sequence. Finally, the full generated MRI sequence $\hat{x}$ is obtained by stacking all the generated MRI slices in sequential order (i.e., $\hat{x} = \{\hat{x}_d\}_{d=0}^{S-1}$).
  • Figure 3: A standard knee plain radiograph and an identified knee joint highlighted in the red box \ref{['plainXray']}. An identified knee joint \ref{['kneeROI']}.
  • Figure 4: Distribution of the number of slices for each type of MRI sequence \ref{['Distribution']}. Grey value distribution at each proportional position of slices within the T1-weighted MRI sequences \ref{['Distribution_']}.
  • Figure 5: The box plots visualize the different performance of the evaluated approaches using PSNR \ref{['PSNRR']} and SSIM/RSSIM \ref{['SSIM_RSSIM']}.
  • ...and 1 more figures