Subgroup Validity in Machine Learning for Echocardiogram Data
Cynthia Feeney, Shane Williams, Benjamin S. Wessler, Michael C. Hughes
TL;DR
The paper investigates subgroup validity in machine learning models for echocardiogram data, arguing that open TTE datasets underreport sociodemographics and lack subgroup performance analyses. It provides new/improved demographic data for TMED-2 and MIMIC-IV-ECHO and evaluates AS-detection models on TMED-2 to assess subgroup validity, finding insufficient evidence due to small subgroup sizes. The authors advocate for larger, more representative datasets, finer-grained demographic categories (including gender diversity), and explicit subgroup-focused evaluation before deployment, stressing reproducibility and living data reporting. Collectively, the work highlights practical barriers to equitable ML deployment in echocardiography and offers a concrete path toward more reliable subgroup validity assessments.
Abstract
Echocardiogram datasets enable training deep learning models to automate interpretation of cardiac ultrasound, thereby expanding access to accurate readings of diagnostically-useful images. However, the gender, sex, race, and ethnicity of the patients in these datasets are underreported and subgroup-specific predictive performance is unevaluated. These reporting deficiencies raise concerns about subgroup validity that must be studied and addressed before model deployment. In this paper, we show that current open echocardiogram datasets are unable to assuage subgroup validity concerns. We improve sociodemographic reporting for two datasets: TMED-2 and MIMIC-IV-ECHO. Analysis of six open datasets reveals no consideration of gender-diverse patients and insufficient patient counts for many racial and ethnic groups. We further perform an exploratory subgroup analysis of two published aortic stenosis detection models on TMED-2. We find insufficient evidence for subgroup validity for sex, racial, and ethnic subgroups. Our findings highlight that more data for underrepresented subgroups, improved demographic reporting, and subgroup-focused analyses are needed to prove subgroup validity in future work.
