Table of Contents
Fetching ...

Multivariate Gaussian Representation Learning for Medical Action Evaluation

Luming Yang, Haoxian Liu, Siqing Li, Alper Yilmaz

TL;DR

This work tackles the need for fine-grained, real-time medical action evaluation in CPR by introducing the CPREval-6k benchmark and GaussMedAct, a dual-stream framework that combines Multivariate Gaussian Representation (MGR) with Hybrid Spatial Encoding (HSE) to produce compact, interpretable action tokens. GaussMedAct delivers state-of-the-art Top-1 accuracy (around 92%) with real-time inference at a fraction of the computational cost of RGB baselines, and demonstrates robustness across datasets and realistic perturbations. The combination of probabilistic temporal modeling and separable joint/bone representations enables precise, robust medical motion understanding and supports downstream tasks like report generation. Collectively, these contributions establish a foundation for real-time CPR evaluation and extendable approaches for other medical action assessment tasks.

Abstract

Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.

Multivariate Gaussian Representation Learning for Medical Action Evaluation

TL;DR

This work tackles the need for fine-grained, real-time medical action evaluation in CPR by introducing the CPREval-6k benchmark and GaussMedAct, a dual-stream framework that combines Multivariate Gaussian Representation (MGR) with Hybrid Spatial Encoding (HSE) to produce compact, interpretable action tokens. GaussMedAct delivers state-of-the-art Top-1 accuracy (around 92%) with real-time inference at a fraction of the computational cost of RGB baselines, and demonstrates robustness across datasets and realistic perturbations. The combination of probabilistic temporal modeling and separable joint/bone representations enables precise, robust medical motion understanding and supports downstream tasks like report generation. Collectively, these contributions establish a foundation for real-time CPR evaluation and extendable approaches for other medical action assessment tasks.

Abstract

Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.

Paper Structure

This paper contains 12 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Efficiency-Accuracy Trade-off Comparison. Each point represents a model, with coordinates indicating computational complexity (GFLOPs) in log scale and top-1 accuracy. The proposed model is Pareto optimal.
  • Figure 2: Dataset Overview. A multi-view CPR dataset with hierarchical error annotation, comprising 6 primary error classes and 21 fine-grained sub-classes. Each instance includes one primary error label and multiple secondary labels for compound error analysis.
  • Figure 3: Schematic of the GaussMedAct Pipeline. Input data undergoes cartesian and vector based dual-stream encoding and pass through MGR to generate gaussians. Through feature fusion, action tokens are generated for downstream tasks.
  • Figure 4: Feature Discriminability of Hybrid Spatial Encoding. The figure illustrates three information modes: cartesian-based, polar-based, and vector-based. Using chest compression and limb tilt as prototypes, the analysis reveals distinct signal-formative capabilities: Sensitive modes generate structured point clusters that fit to kinematic functions, while insensitive modes exhibit noise distributions. The hybrid architecture orchestrates dual-stream processing to adaptively harness these geometric discriminators.
  • Figure 5: Dynamics Visualization of MGR. Each ellipsoid represents a Gaussian distribution (mean $\mu$, covariance $\Sigma$) of a single joint/bone dynamics over time. From the perspective of stream-specific strengths, Joint Cartesian Stream excels at capturing global trajectory consistency (e.g., palm trajectory); Bone vector better encodes kinematic transitions (e.g., arm bend).
  • ...and 1 more figures