Table of Contents
Fetching ...

Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

Felipe Akio Matsuoka, Eduardo Moreno J. M. Farina, Augusto Sarquis Serpa, Soraya Monteiro, Rodrigo Ragazzini, Nitamar Abdala, Marcelo Straus Takahashi, Felipe Campos Kitamura

TL;DR

This study evaluates whether generative inpainting using a GPT-based editor preserves clinically relevant cues in pediatric hand radiographs. By creating three inpainted variants per image to remove non-anatomical artifacts and testing bone age and gender predictors, the authors show that inpainting substantially degrades downstream performance, with bone age MAE rising from 6.26 to 30.11 months and AUC for gender dropping from 0.956 to 0.704. Calibration only partially mitigates errors, indicating nonlinear distortions in anatomical features used by the predictors. The results highlight significant risks of latent bias and feature alteration when applying foundation-model inpainting in clinical workflows, underscoring the need for rigorous, task-specific validation and domain-tailored edits.

Abstract

Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.

Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

TL;DR

This study evaluates whether generative inpainting using a GPT-based editor preserves clinically relevant cues in pediatric hand radiographs. By creating three inpainted variants per image to remove non-anatomical artifacts and testing bone age and gender predictors, the authors show that inpainting substantially degrades downstream performance, with bone age MAE rising from 6.26 to 30.11 months and AUC for gender dropping from 0.956 to 0.704. Calibration only partially mitigates errors, indicating nonlinear distortions in anatomical features used by the predictors. The results highlight significant risks of latent bias and feature alteration when applying foundation-model inpainting in clinical workflows, underscoring the need for rigorous, task-specific validation and domain-tailored edits.

Abstract

Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.

Paper Structure

This paper contains 8 sections, 4 figures.

Figures (4)

  • Figure 1: Comparison of original and inpainted pediatric hand X-ray. The image generator not only removed artifacts but also made the bones appear more mature.
  • Figure 2: Model predictions on original vs. inpainted images (top: scatter with regression and identity line). Right: calibrated predictions vs. ground truth. Dashed lines indicate perfect agreement.
  • Figure 3: Confusion matrices for gender classification on the inpainted dataset and on the original dataset.
  • Figure 4: Visual and quantitative comparison of image appearance before and after inpainting. (Left) Pixel intensity distribution of original and inpainted hand radiographs, showing the normalized frequency of grayscale values (0--255) across all images. (Right) Boxplot of per-image pixel intensity standard deviation, confirming a significant decrease in noise after inpainting.