Table of Contents
Fetching ...

SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model

Chun Xie, Yuichi Yoshii, Itaru Kitahara

TL;DR

This work tackles multi-view X-ray synthesis from a single view to reduce radiation and streamline workflows. It introduces SV-DRR, a view-conditioned Diffusion Transformer operating in the VAE latent space, with a weak-to-strong training regime and two conditioning streams to preserve anatomical fidelity across large viewpoint changes. The approach achieves superior quantitative performance over state-of-the-art methods and demonstrates realism levels indistinguishable from diffusion-based simulations in expert assessments. The densely sampled LIDC-IDRI-DRR dataset and the proposed conditioning framework enable robust high-resolution multi-view X-ray generation with practical implications for clinical education, data augmentation, and sparse-view imaging research.

Abstract

X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at https://github.com/xiechun298/SV-DRR.

SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model

TL;DR

This work tackles multi-view X-ray synthesis from a single view to reduce radiation and streamline workflows. It introduces SV-DRR, a view-conditioned Diffusion Transformer operating in the VAE latent space, with a weak-to-strong training regime and two conditioning streams to preserve anatomical fidelity across large viewpoint changes. The approach achieves superior quantitative performance over state-of-the-art methods and demonstrates realism levels indistinguishable from diffusion-based simulations in expert assessments. The densely sampled LIDC-IDRI-DRR dataset and the proposed conditioning framework enable robust high-resolution multi-view X-ray generation with practical implications for clinical education, data augmentation, and sparse-view imaging research.

Abstract

X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at https://github.com/xiechun298/SV-DRR.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of SV-DRR. Given a source X-ray image $I^S$ and relative target views $v^T$, SV-DRR synthesizes realistic X-ray projections $\hat{I}^T$ corresponding to $v^T$ from Gaussian noise $z_t$ via a denoising DiT. The denoising process takes place in the latent space of a VAE, where $\mathcal{E}$, and $\mathcal{D}$ denote the encoder and decoder, respectively.
  • Figure 2: Comparison of synthesized X-ray images at different resolutions using our method (256, 512, 1024) against baselines. Our method achieves superior fidelity in structure and orientation for both simulated DRR(left) and real X-ray(right) input. $\theta$ and $\phi$ represent the azimuthal and elevation angles of the synthesized view, respectively. Our method's different resolutions are normalized for better visualization. Additionally, the brightness and contrast of XraySyn images were manually adjusted for better comparison without introducing any negative effects on the results.