Table of Contents
Fetching ...

Echo-E$^3$Net: Efficient Endocardial Spatio-Temporal Network for Ejection Fraction Estimation

Moein Heidari, Afshin Bozorgpour, AmirHossein Zarif-Fakharnia, Wenjin Chen, Dorit Merhof, David J Foran, Jasmine Grewal, Ilker Hacihaliloglu

TL;DR

The paper addresses the need for accurate yet efficient LVEF estimation from echocardiography suitable for real-time POCUS. It introduces Echo-E3Net, a lightweight framework that jointly models dual-phase endocardial borders and global spatio-temporal features using a LHUNet-based backbone, a dual-phase Endocardial Border Detector (E2CBD), and an Endocardial Feature Aggregator (E2FA). A Simpson-inspired geometric loss ties EF regression to anatomically meaningful LV geometry, improving robustness in clinically critical low-EF ranges. The approach achieves state-of-the-art performance on EchoNet-Dynamic with only 1.54M parameters and 8.05 GFLOPs, while delivering real-time CPU performance, underscoring its potential for deployment in resource-constrained POCUS environments. The authors also provide ablations and visualizations to corroborate the benefits of the anatomical guidance and efficient design.

Abstract

Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and is routinely used to diagnose heart failure and guide treatment decisions. Although deep learning has advanced automated LVEF estimation, many existing approaches are computationally demanding and underutilize the joint structure of spatial and temporal information in echocardiography videos, limiting their suitability for real-time clinical deployment. We propose Echo-E$^3$Net, an efficient endocardial spatio-temporal network specifically designed for LVEF estimation from echocardiography videos. Echo-E$^3$Net comprises two complementary modules: (1) a dual-phase Endocardial Border Detector (E$^2$CBD), which uses phase-specific cross-attention to predict ED/ES endocardial landmarks (EBs) and learn phase-aware landmark embeddings (LEs), and (2) an Endocardial Feature Aggregator (E$^2$FA), which fuses these embeddings with global statistical descriptors (mean, maximum, variance) of deep feature maps to refine EF regression. A multi-component loss function, inspired by Simpson's biplane method, jointly supervises EF, volumes, and landmark geometry, thereby aligning optimization with the clinical definition of LVEF and promoting robust spatio-temporal representation learning. Evaluated on the EchoNet-Dynamic dataset, Echo-E$^3$Net achieves an RMSE of 5.20 and an $R^2$ score of 0.82, while using only 1.54M parameters and 8.05 GFLOPs. The model operates without external pre-training, heavy data augmentation, or test-time ensembling, making it highly suitable for real-time point-of-care ultrasound (POCUS) applications. Code is available at https://github.com/UltrAi-lab/Echo-E3Net.

Echo-E$^3$Net: Efficient Endocardial Spatio-Temporal Network for Ejection Fraction Estimation

TL;DR

The paper addresses the need for accurate yet efficient LVEF estimation from echocardiography suitable for real-time POCUS. It introduces Echo-E3Net, a lightweight framework that jointly models dual-phase endocardial borders and global spatio-temporal features using a LHUNet-based backbone, a dual-phase Endocardial Border Detector (E2CBD), and an Endocardial Feature Aggregator (E2FA). A Simpson-inspired geometric loss ties EF regression to anatomically meaningful LV geometry, improving robustness in clinically critical low-EF ranges. The approach achieves state-of-the-art performance on EchoNet-Dynamic with only 1.54M parameters and 8.05 GFLOPs, while delivering real-time CPU performance, underscoring its potential for deployment in resource-constrained POCUS environments. The authors also provide ablations and visualizations to corroborate the benefits of the anatomical guidance and efficient design.

Abstract

Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and is routinely used to diagnose heart failure and guide treatment decisions. Although deep learning has advanced automated LVEF estimation, many existing approaches are computationally demanding and underutilize the joint structure of spatial and temporal information in echocardiography videos, limiting their suitability for real-time clinical deployment. We propose Echo-ENet, an efficient endocardial spatio-temporal network specifically designed for LVEF estimation from echocardiography videos. Echo-ENet comprises two complementary modules: (1) a dual-phase Endocardial Border Detector (ECBD), which uses phase-specific cross-attention to predict ED/ES endocardial landmarks (EBs) and learn phase-aware landmark embeddings (LEs), and (2) an Endocardial Feature Aggregator (EFA), which fuses these embeddings with global statistical descriptors (mean, maximum, variance) of deep feature maps to refine EF regression. A multi-component loss function, inspired by Simpson's biplane method, jointly supervises EF, volumes, and landmark geometry, thereby aligning optimization with the clinical definition of LVEF and promoting robust spatio-temporal representation learning. Evaluated on the EchoNet-Dynamic dataset, Echo-ENet achieves an RMSE of 5.20 and an score of 0.82, while using only 1.54M parameters and 8.05 GFLOPs. The model operates without external pre-training, heavy data augmentation, or test-time ensembling, making it highly suitable for real-time point-of-care ultrasound (POCUS) applications. Code is available at https://github.com/UltrAi-lab/Echo-E3Net.

Paper Structure

This paper contains 20 sections, 27 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the LHUNet encoder architecture sadegheih2024lhu. The encoder produces multi-scale spatio-temporal features; we use the deepest feature map for global aggregation and the skip features as inputs to our E2CBD module.
  • Figure 2: Overall architecture of Echo-E3Net. The input echocardiographic video is processed by the LHUNet encoder to produce multi-scale spatio-temporal features. The E2CBD module applies phase-specific cross-attention from ED/ES landmark queries to the multi-scale token set, yielding explicit dual-phase landmark coordinates and corresponding landmark embeddings. The E2FA module aggregates global statistics (average, maximum, variance) from the deepest feature map and fuses them with the landmark descriptor to regress EF and, when available, EDV/ESV.
  • Figure 3: Left ventricular measurements using the biplane Simpson's method (Figure adapted from liu2025think). The clinical workflow relies on accurate localization of key landmarks (apex and mitral annulus endpoints) to define the LV long axis, followed by diameter measurements at multiple levels and volumetric integration using stacked elliptical disks. Our geometric losses are inspired by these principles.
  • Figure 4: Illustration of Simpson's biplane method (Figure adapted from ecgwaves_ef). LV volume is computed by stacking elliptical disks with diameters $a$ and $b$ and height $h$. Translating this into differentiable constraints allows us to regularize landmark predictions without explicit volumetric integration at inference time.
  • Figure 5: (Left) The confusion matrix of our top-performing model. (Right) The scatter plot of our model's EF predictions with the actual values.
  • ...and 2 more figures