Detecting Heart Disease from Multi-View Ultrasound Images via Supervised Attention Multiple Instance Learning
Zhe Huang, Benjamin S. Wessler, Michael C. Hughes
TL;DR
This work tackles automatic AS screening from multi-view echocardiography by reframing AS diagnosis as a multiple-instance learning problem over dozens of images with unknown views. It introduces SAMIL, a supervised-attention MIL architecture that learns to focus on clinically relevant views (PLAX/PSAX AoV) while allowing flexible attention across images, and a study-level self-supervised pretraining regime that pretrains representations at the bag (study) level rather than per-image. The approach yields superior balanced accuracy on TMED-2 across splits and competitive external validation results, while reducing model size and improving interpretability through attention aligned with view relevance. The key contributions—classifier-guided attention and bag-level SSL—offer a generalizable framework for multi-view medical imaging tasks where aggregated study-level outcomes are sought from heterogeneous image collections, enabling scalable, portable screening tools.
Abstract
Aortic stenosis (AS) is a degenerative valve condition that causes substantial morbidity and mortality. This condition is under-diagnosed and under-treated. In clinical practice, AS is diagnosed with expert review of transthoracic echocardiography, which produces dozens of ultrasound images of the heart. Only some of these views show the aortic valve. To automate screening for AS, deep networks must learn to mimic a human expert's ability to identify views of the aortic valve then aggregate across these relevant images to produce a study-level diagnosis. We find previous approaches to AS detection yield insufficient accuracy due to relying on inflexible averages across images. We further find that off-the-shelf attention-based multiple instance learning (MIL) performs poorly. We contribute a new end-to-end MIL approach with two key methodological innovations. First, a supervised attention technique guides the learned attention mechanism to favor relevant views. Second, a novel self-supervised pretraining strategy applies contrastive learning on the representation of the whole study instead of individual images as commonly done in prior literature. Experiments on an open-access dataset and an external validation set show that our approach yields higher accuracy while reducing model size.
