Multivariate Gaussian Representation Learning for Medical Action Evaluation
Luming Yang, Haoxian Liu, Siqing Li, Alper Yilmaz
TL;DR
This work tackles the need for fine-grained, real-time medical action evaluation in CPR by introducing the CPREval-6k benchmark and GaussMedAct, a dual-stream framework that combines Multivariate Gaussian Representation (MGR) with Hybrid Spatial Encoding (HSE) to produce compact, interpretable action tokens. GaussMedAct delivers state-of-the-art Top-1 accuracy (around 92%) with real-time inference at a fraction of the computational cost of RGB baselines, and demonstrates robustness across datasets and realistic perturbations. The combination of probabilistic temporal modeling and separable joint/bone representations enables precise, robust medical motion understanding and supports downstream tasks like report generation. Collectively, these contributions establish a foundation for real-time CPR evaluation and extendable approaches for other medical action assessment tasks.
Abstract
Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.
