Table of Contents
Fetching ...

Anatomically Constrained Transformers for Echocardiogram Analysis

Alexander Thorley, Agis Chartsias, Jordan Strom, Jeremy Slivnick, Dipak Kotecha, Alberto Gomez, Jinming Duan

TL;DR

ViACT tackles spurious correlations in echocardiography by constraining transformer inputs to deforming myocardium patches via a myocardium point set, sampled with non-integer locations. It introduces an anatomical MAE pre-training scheme that reconstructs only myocardial patches, reducing compute while guiding representations to diagnostically relevant content. The approach delivers strong performance on cardiac amyloidosis classification and left ventricular EF regression, and enables myocardium point tracking without correlation volumes, with attention maps localized to pathology-relevant regions. This anatomically grounded backbone offers interpretable, task-agnostic utility for diverse echo analysis tasks and a scalable path toward end-to-end myocardium-focused analysis.

Abstract

Video transformers have recently demonstrated strong potential for echocardiogram (echo) analysis, leveraging self-supervised pre-training and flexible adaptation across diverse tasks. However, like other models operating on videos, they are prone to learning spurious correlations from non-diagnostic regions such as image backgrounds. To overcome this limitation, we propose the Video Anatomically Constrained Transformer (ViACT), a novel framework that integrates anatomical priors directly into the transformer architecture. ViACT represents a deforming anatomical structure as a point set and encodes both its spatial geometry and corresponding image patches into transformer tokens. During pre-training, ViACT follows a masked autoencoding strategy that masks and reconstructs only anatomical patches, enforcing that representation learning is focused on the anatomical region. The pre-trained model can then be fine-tuned for tasks localized to this region. In this work we focus on the myocardium, demonstrating the framework on echo analysis tasks such as left ventricular ejection fraction (EF) regression and cardiac amyloidosis (CA) detection. The anatomical constraint focuses transformer attention within the myocardium, yielding interpretable attention maps aligned with regions of known CA pathology. Moreover, ViACT generalizes to myocardium point tracking without requiring task-specific components such as correlation volumes used in specialized tracking networks.

Anatomically Constrained Transformers for Echocardiogram Analysis

TL;DR

ViACT tackles spurious correlations in echocardiography by constraining transformer inputs to deforming myocardium patches via a myocardium point set, sampled with non-integer locations. It introduces an anatomical MAE pre-training scheme that reconstructs only myocardial patches, reducing compute while guiding representations to diagnostically relevant content. The approach delivers strong performance on cardiac amyloidosis classification and left ventricular EF regression, and enables myocardium point tracking without correlation volumes, with attention maps localized to pathology-relevant regions. This anatomically grounded backbone offers interpretable, task-agnostic utility for diverse echo analysis tasks and a scalable path toward end-to-end myocardium-focused analysis.

Abstract

Video transformers have recently demonstrated strong potential for echocardiogram (echo) analysis, leveraging self-supervised pre-training and flexible adaptation across diverse tasks. However, like other models operating on videos, they are prone to learning spurious correlations from non-diagnostic regions such as image backgrounds. To overcome this limitation, we propose the Video Anatomically Constrained Transformer (ViACT), a novel framework that integrates anatomical priors directly into the transformer architecture. ViACT represents a deforming anatomical structure as a point set and encodes both its spatial geometry and corresponding image patches into transformer tokens. During pre-training, ViACT follows a masked autoencoding strategy that masks and reconstructs only anatomical patches, enforcing that representation learning is focused on the anatomical region. The pre-trained model can then be fine-tuned for tasks localized to this region. In this work we focus on the myocardium, demonstrating the framework on echo analysis tasks such as left ventricular ejection fraction (EF) regression and cardiac amyloidosis (CA) detection. The anatomical constraint focuses transformer attention within the myocardium, yielding interpretable attention maps aligned with regions of known CA pathology. Moreover, ViACT generalizes to myocardium point tracking without requiring task-specific components such as correlation volumes used in specialized tracking networks.

Paper Structure

This paper contains 16 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Top: the ViACT model. Bottom left: the tokenizer component of the model embedding a single frame and corresponding points. Bottom right: example $3 \times 3$ patches (red) centered at integer and non-integer points (teal) on a grid of pixels (gray).
  • Figure 2: The anatomical MAE framework. Graphic inspired by he2022masked.
  • Figure 3: Example reconstructions from the temporal ViACT. For each sample, top row depicts masked patches, middle row reconstructed patches and bottom row ground truth patches from a subset of the 18 processed frames. Masked patches are depicted with transparent teal squares.
  • Figure 4: The pipeline for tuning a ViACT model for the point tracking task.
  • Figure 5: Pre-training compute time and memory usage for a ViACT and MAE-ST feichtenhofer2022masked with varying model size, using a batch size of 10.
  • ...and 1 more figures