Table of Contents
Fetching ...

Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling

Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath

TL;DR

This work introduces virtual clinical trials (VCTs) for radiology AI by developing a conditional, full-body CT synthesis model based on a latent diffusion framework. The system jointly models $p(\mathbf{X},\mathbf{\Y})$ and $p(\mathbf{Z}_{\rm img},\mathbf{Z}_{\rm seg}|\mathbf{a})$ to generate anatomically consistent CT images and segmentations conditioned on demographic attributes $\mathbf{a}$. Through comprehensive evaluation (FID, Dice, organ-volume/centroid correlations, and conditioning fidelity), the authors demonstrate high realism and anatomical plausibility, enabling scalable VCTs for bias auditing and robustness assessment. Applying VCTs to body-fat and muscle-mass estimation tasks, they show synthetic cohorts recapitulate real-world degradation and biases, outperform conventional weighting in detecting OOD degradation, and reveal the attributes most predictive of errors. The results suggest VCTs can streamline proactive AI validation, help mitigate biases, and support safer deployment of radiology AI, with future work expanding conditioning and emergent properties at scale.

Abstract

Artificial intelligence (AI) is poised to transform healthcare by enabling personalized and efficient care through data-driven insights. Although radiology is at the forefront of AI adoption, in practice, the potential of AI models is often overshadowed by severe failures to generalize: AI models can have performance degradation of up to 20% when transitioning from controlled test environments to clinical use by radiologists. This mismatch raises concerns that radiologists will be misled by incorrect AI predictions in practice and/or grow to distrust AI, rendering these promising technologies practically ineffectual. Exhaustive clinical trials of AI models on abundant and diverse data is thus critical to anticipate AI model degradation when encountering varied data samples. Achieving these goals, however, is challenging due to the high costs of collecting diverse data samples and corresponding annotations. To overcome these limitations, we introduce a novel conditional generative AI model designed for virtual clinical trials (VCTs) of radiology AI, capable of realistically synthesizing full-body CT images of patients with specified attributes. By learning the joint distribution of images and anatomical structures, our model enables precise replication of real-world patient populations with unprecedented detail at this scale. We demonstrate meaningful evaluation of radiology AI models through VCTs powered by our synthetic CT study populations, revealing model degradation and facilitating algorithmic auditing for bias-inducing data attributes. Our generative AI approach to VCTs is a promising avenue towards a scalable solution to assess model robustness, mitigate biases, and safeguard patient care by enabling simpler testing and evaluation of AI models in any desired range of diverse patient populations.

Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling

TL;DR

This work introduces virtual clinical trials (VCTs) for radiology AI by developing a conditional, full-body CT synthesis model based on a latent diffusion framework. The system jointly models and to generate anatomically consistent CT images and segmentations conditioned on demographic attributes . Through comprehensive evaluation (FID, Dice, organ-volume/centroid correlations, and conditioning fidelity), the authors demonstrate high realism and anatomical plausibility, enabling scalable VCTs for bias auditing and robustness assessment. Applying VCTs to body-fat and muscle-mass estimation tasks, they show synthetic cohorts recapitulate real-world degradation and biases, outperform conventional weighting in detecting OOD degradation, and reveal the attributes most predictive of errors. The results suggest VCTs can streamline proactive AI validation, help mitigate biases, and support safer deployment of radiology AI, with future work expanding conditioning and emergent properties at scale.

Abstract

Artificial intelligence (AI) is poised to transform healthcare by enabling personalized and efficient care through data-driven insights. Although radiology is at the forefront of AI adoption, in practice, the potential of AI models is often overshadowed by severe failures to generalize: AI models can have performance degradation of up to 20% when transitioning from controlled test environments to clinical use by radiologists. This mismatch raises concerns that radiologists will be misled by incorrect AI predictions in practice and/or grow to distrust AI, rendering these promising technologies practically ineffectual. Exhaustive clinical trials of AI models on abundant and diverse data is thus critical to anticipate AI model degradation when encountering varied data samples. Achieving these goals, however, is challenging due to the high costs of collecting diverse data samples and corresponding annotations. To overcome these limitations, we introduce a novel conditional generative AI model designed for virtual clinical trials (VCTs) of radiology AI, capable of realistically synthesizing full-body CT images of patients with specified attributes. By learning the joint distribution of images and anatomical structures, our model enables precise replication of real-world patient populations with unprecedented detail at this scale. We demonstrate meaningful evaluation of radiology AI models through VCTs powered by our synthetic CT study populations, revealing model degradation and facilitating algorithmic auditing for bias-inducing data attributes. Our generative AI approach to VCTs is a promising avenue towards a scalable solution to assess model robustness, mitigate biases, and safeguard patient care by enabling simpler testing and evaluation of AI models in any desired range of diverse patient populations.

Paper Structure

This paper contains 29 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: AI-based medical image analysis algorithms are susceptible to drops in performance when deployed on new populations. (a) The approval pipeline for medical image AI necessitates large cohort selection and costly data collection processes so as to ensure good performance across the given population. Performance may still decline when deployed on new populations.oakden2020hidden (b) We propose a novel framework for medical image AI validation, where a conditional generative model provides full-body images with the same distribution of attributes, i.e. demographics or other characteristics, as the target population. This enables in silico clinical trials much earlier in the development pipeline, ensuring high performance on desired populations before real clinical trials.
  • Figure 2: A conditional generative model for full body CT synthesis. (a) Two autoencoders are responsible for compressing the 3D image and segmentation to latent embeddings $\mathbf{Z}_{\rm img}$ and $\mathbf{Z}_{\rm img}$, respectively. (b) A denoising diffusion model learns to sample the distribution for paired embeddings $\mathbf{Z} = [\mathbf{Z}_{\rm img}, \mathbf{Z}_{\rm seg}]$, conditioned on patient attributes $\mathbf{a}$. (c) During image synthesis, the diffusion model samples a random latent code $\mathbf{Z}$, which is decoded separately into the synthetic CT and corresponding segmentation.
  • Figure 3: Example outputs from the model. (a) A real image in the training set, in this case from a 66 year old male measuring 180 cm and 70 kg, with an amputated right leg. (b) The corresponding VQ-VAE reconstruction of the image. (c) A synthetic sample conditioned to align with the same patient attributes (male, 50-60 years old, 170-180 cm, and 60 - 70 kg). Since missing limbs are not included in conditioning, the synthetic image reflects the general population rather than the corresponding case in the training set. (d) The synthetic segmentation generated alongside (c). An independent segmentation of the synthetic image (c) using TotalSegmentator,wasserthal2023totalsegmentator with a corresponding class mapping.
  • Figure 4: Our model's fidelity to the conditioning categories for age, height, and weight. We show the distribution of measured values based on real CT images, the same images reconstructed with the VQ-VAE, and synthetic images sampled from the same conditioning attributes. The boxes show the quartiles, with whiskers extending to include all inliers. Outliers, as determined based on the inter-quartile range, are shown independently. Because all measurements are calculated using the CT image and an independent organ segmentation,wasserthal2023totalsegmentator the conditioned and measured attribute may differ, even for real images. Nevertheless, the alignment between the measured values in synthetic and real images shows our generative model's conditioning faithfully reflects the relevant properties in the real data.
  • Figure 5: Results of the VCT, including absolute error for BFP and MMP. See Table \ref{['tab:z-scores']} for complete quantitative results.
  • ...and 2 more figures