Table of Contents
Fetching ...

Detecting Heart Disease from Multi-View Ultrasound Images via Supervised Attention Multiple Instance Learning

Zhe Huang, Benjamin S. Wessler, Michael C. Hughes

TL;DR

This work tackles automatic AS screening from multi-view echocardiography by reframing AS diagnosis as a multiple-instance learning problem over dozens of images with unknown views. It introduces SAMIL, a supervised-attention MIL architecture that learns to focus on clinically relevant views (PLAX/PSAX AoV) while allowing flexible attention across images, and a study-level self-supervised pretraining regime that pretrains representations at the bag (study) level rather than per-image. The approach yields superior balanced accuracy on TMED-2 across splits and competitive external validation results, while reducing model size and improving interpretability through attention aligned with view relevance. The key contributions—classifier-guided attention and bag-level SSL—offer a generalizable framework for multi-view medical imaging tasks where aggregated study-level outcomes are sought from heterogeneous image collections, enabling scalable, portable screening tools.

Abstract

Aortic stenosis (AS) is a degenerative valve condition that causes substantial morbidity and mortality. This condition is under-diagnosed and under-treated. In clinical practice, AS is diagnosed with expert review of transthoracic echocardiography, which produces dozens of ultrasound images of the heart. Only some of these views show the aortic valve. To automate screening for AS, deep networks must learn to mimic a human expert's ability to identify views of the aortic valve then aggregate across these relevant images to produce a study-level diagnosis. We find previous approaches to AS detection yield insufficient accuracy due to relying on inflexible averages across images. We further find that off-the-shelf attention-based multiple instance learning (MIL) performs poorly. We contribute a new end-to-end MIL approach with two key methodological innovations. First, a supervised attention technique guides the learned attention mechanism to favor relevant views. Second, a novel self-supervised pretraining strategy applies contrastive learning on the representation of the whole study instead of individual images as commonly done in prior literature. Experiments on an open-access dataset and an external validation set show that our approach yields higher accuracy while reducing model size.

Detecting Heart Disease from Multi-View Ultrasound Images via Supervised Attention Multiple Instance Learning

TL;DR

This work tackles automatic AS screening from multi-view echocardiography by reframing AS diagnosis as a multiple-instance learning problem over dozens of images with unknown views. It introduces SAMIL, a supervised-attention MIL architecture that learns to focus on clinically relevant views (PLAX/PSAX AoV) while allowing flexible attention across images, and a study-level self-supervised pretraining regime that pretrains representations at the bag (study) level rather than per-image. The approach yields superior balanced accuracy on TMED-2 across splits and competitive external validation results, while reducing model size and improving interpretability through attention aligned with view relevance. The key contributions—classifier-guided attention and bag-level SSL—offer a generalizable framework for multi-view medical imaging tasks where aggregated study-level outcomes are sought from heterogeneous image collections, enabling scalable, portable screening tools.

Abstract

Aortic stenosis (AS) is a degenerative valve condition that causes substantial morbidity and mortality. This condition is under-diagnosed and under-treated. In clinical practice, AS is diagnosed with expert review of transthoracic echocardiography, which produces dozens of ultrasound images of the heart. Only some of these views show the aortic valve. To automate screening for AS, deep networks must learn to mimic a human expert's ability to identify views of the aortic valve then aggregate across these relevant images to produce a study-level diagnosis. We find previous approaches to AS detection yield insufficient accuracy due to relying on inflexible averages across images. We further find that off-the-shelf attention-based multiple instance learning (MIL) performs poorly. We contribute a new end-to-end MIL approach with two key methodological innovations. First, a supervised attention technique guides the learned attention mechanism to favor relevant views. Second, a novel self-supervised pretraining strategy applies contrastive learning on the representation of the whole study instead of individual images as commonly done in prior literature. Experiments on an open-access dataset and an external validation set show that our approach yields higher accuracy while reducing model size.
Paper Structure (54 sections, 8 equations, 6 figures, 13 tables)

This paper contains 54 sections, 8 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Overview of methods for diagnosing aortic valve disease from multiple images of the heart. In our chosen diagnostic problem, the input is multiple ultrasound images representing different canonical view types of the heart's complex anatomy (e.g. PLAX, PSAX, A2C, A4C, and more, see mitchell2019guidelines for a taxonomy). The output is a probabilistic prediction of the severity of Aortic Stenosis (AS), on a 3-level scale of no / early / significant disease. We wish to develop deep learning methods that can solve this problem like expert cardiologists (panel a). Two recent efforts (panel b by others, panel c by our group) made progress using a separately-trained view type classifier and per-image diagnosis classifier, but rely on combining diagnosis probabilities across images via average pooling that cannot learn how to distribute attention non-uniformly among images of relevant views. In this work, we develop flexible attention-based multiple instance learning (MIL, panel d), with crucial contributions of supervised attention (Sec. \ref{['sec:methods_SA']}) and improved pretraining strategies (Sec. \ref{['sec:methods_CL']}) that substantially improve performance at our task.
  • Figure 2: Overview of proposed method: Supervised Attention Multiple Instance Learning (SAMIL). Given a study or "bag" with many images of diverse views of unknown type, a feature extractor processes each image individually into an embedding vector. Two attention modules (one supervised by a view classifier and one without) produce attention weights for each instance. The final study representation averages the image embeddings by combining the two attentions (Eq. \ref{['eq:patient_embedding_samil']}). A fully-connected (FC) layer maps the study representation to a 3-class diagnosis (no/early/significant AS). Pretraining: SAMIL can be pretrained using bag-level (recommended, Sec. \ref{['sec:methods_CL']}) or image-level contrastive learning. In either case, a projection head maps representations to a latent space where the contrastive loss is applied chen2020simplechen2020improved. The projection head is discarded after pretraining.
  • Figure 3: Predicted view relevance of top-ranked images by attention (higher is better). Supervised attention (SAMIL, ours) outperforms off-the-shelf ABMIL by wide margin across all 3 splits. The x-axis indicates a rank position of images within an echo study when sorted by attention (1 = largest $a_k$, 2 = second largest, etc.). The y-axis indicates the average view relevance (across studies in test set) assigned by view classifier $v(x)$ to image $x$ at rank $k$.
  • Figure A.1: Confusion matrices for the patient-level AS diagnosis classification, across three predefined train/test splits of TMED2.
  • Figure A.2: Diagnosis classification receiver operator curves. Showing results across three predefined train/test splits of TMED2 and three clinically relevant screening tasks.
  • ...and 1 more figures