Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation
Felipe Akio Matsuoka, Eduardo Moreno J. M. Farina, Augusto Sarquis Serpa, Soraya Monteiro, Rodrigo Ragazzini, Nitamar Abdala, Marcelo Straus Takahashi, Felipe Campos Kitamura
TL;DR
This study evaluates whether generative inpainting using a GPT-based editor preserves clinically relevant cues in pediatric hand radiographs. By creating three inpainted variants per image to remove non-anatomical artifacts and testing bone age and gender predictors, the authors show that inpainting substantially degrades downstream performance, with bone age MAE rising from 6.26 to 30.11 months and AUC for gender dropping from 0.956 to 0.704. Calibration only partially mitigates errors, indicating nonlinear distortions in anatomical features used by the predictors. The results highlight significant risks of latent bias and feature alteration when applying foundation-model inpainting in clinical workflows, underscoring the need for rigorous, task-specific validation and domain-tailored edits.
Abstract
Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.
