Table of Contents
Fetching ...

Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics

Siyeop Yoon, Sifan Song, Pengfei Jin, Matthew Tivnan, Yujin Oh, Sekeun Kim, Dufan Wu, Xiang Li, Quanzheng Li

TL;DR

This paper tackles the challenge of generating realistic whole-body PET/CT images from demographic information to support digital twins, virtual trials, and data augmentation. It introduces a cascaded 3D diffusion framework that first creates a low-resolution, demography-conditioned global volume via a score-based diffusion, then refines it with a super-resolution residual diffusion trained with patch-wise conditioning. Training on the AutoPET dataset and evaluating organ-wise volumes and SUV distributions, the approach achieves strong agreement with real data, with most uptake deviations within a few percent. The results demonstrate that decoupling global structure from high-frequency details enables scalable, population-informed synthetic imaging and offers a viable alternative to conventional phantoms.

Abstract

We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications.

Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics

TL;DR

This paper tackles the challenge of generating realistic whole-body PET/CT images from demographic information to support digital twins, virtual trials, and data augmentation. It introduces a cascaded 3D diffusion framework that first creates a low-resolution, demography-conditioned global volume via a score-based diffusion, then refines it with a super-resolution residual diffusion trained with patch-wise conditioning. Training on the AutoPET dataset and evaluating organ-wise volumes and SUV distributions, the approach achieves strong agreement with real data, with most uptake deviations within a few percent. The results demonstrate that decoupling global structure from high-frequency details enables scalable, population-informed synthetic imaging and offers a viable alternative to conventional phantoms.

Abstract

We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications.

Paper Structure

This paper contains 8 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the cascaded 3D diffusion framework for demographic-driven PET/CT synthesis. (A) During training, demographics guide the global diffusion model to generate a low-resolution 3D PET/CT, which is then interpolated and refined by a super-resolution diffusion model to produce high-resolution outputs. (B) For evaluation, the demographics from the testing set are matched and input to the same cascaded process. This yields synthetic PET/CT compared to real data cohorts, focusing on organ-wise volume and SUV metrics.
  • Figure 2: Representative examples of 18-F FDG PET/CT generated using cascaded 3D diffusion models. The images show CT, 18-FDG SUV, and 3D renderings for synthetic subjects of the same age (60 years) with different heights (male 175 cm vs. female 165 cm) and BMIs. The results demonstrate plausible anatomical and metabolic differences, including variations in adipose tissue distribution and PET signal heterogeneity.
  • Figure 3: Representative slices from the AutoPET (left) compared with synthetic CT/PET generated by the flow-matching (middle) and the diffusion model (right).