Echo-E$^3$Net: Efficient Endocardial Spatio-Temporal Network for Ejection Fraction Estimation
Moein Heidari, Afshin Bozorgpour, AmirHossein Zarif-Fakharnia, Wenjin Chen, Dorit Merhof, David J Foran, Jasmine Grewal, Ilker Hacihaliloglu
TL;DR
The paper addresses the need for accurate yet efficient LVEF estimation from echocardiography suitable for real-time POCUS. It introduces Echo-E3Net, a lightweight framework that jointly models dual-phase endocardial borders and global spatio-temporal features using a LHUNet-based backbone, a dual-phase Endocardial Border Detector (E2CBD), and an Endocardial Feature Aggregator (E2FA). A Simpson-inspired geometric loss ties EF regression to anatomically meaningful LV geometry, improving robustness in clinically critical low-EF ranges. The approach achieves state-of-the-art performance on EchoNet-Dynamic with only 1.54M parameters and 8.05 GFLOPs, while delivering real-time CPU performance, underscoring its potential for deployment in resource-constrained POCUS environments. The authors also provide ablations and visualizations to corroborate the benefits of the anatomical guidance and efficient design.
Abstract
Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and is routinely used to diagnose heart failure and guide treatment decisions. Although deep learning has advanced automated LVEF estimation, many existing approaches are computationally demanding and underutilize the joint structure of spatial and temporal information in echocardiography videos, limiting their suitability for real-time clinical deployment. We propose Echo-E$^3$Net, an efficient endocardial spatio-temporal network specifically designed for LVEF estimation from echocardiography videos. Echo-E$^3$Net comprises two complementary modules: (1) a dual-phase Endocardial Border Detector (E$^2$CBD), which uses phase-specific cross-attention to predict ED/ES endocardial landmarks (EBs) and learn phase-aware landmark embeddings (LEs), and (2) an Endocardial Feature Aggregator (E$^2$FA), which fuses these embeddings with global statistical descriptors (mean, maximum, variance) of deep feature maps to refine EF regression. A multi-component loss function, inspired by Simpson's biplane method, jointly supervises EF, volumes, and landmark geometry, thereby aligning optimization with the clinical definition of LVEF and promoting robust spatio-temporal representation learning. Evaluated on the EchoNet-Dynamic dataset, Echo-E$^3$Net achieves an RMSE of 5.20 and an $R^2$ score of 0.82, while using only 1.54M parameters and 8.05 GFLOPs. The model operates without external pre-training, heavy data augmentation, or test-time ensembling, making it highly suitable for real-time point-of-care ultrasound (POCUS) applications. Code is available at https://github.com/UltrAi-lab/Echo-E3Net.
