Table of Contents
Fetching ...

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Yuan-Ming Li, Wei-Jin Huang, An-Lan Wang, Ling-An Zeng, Jing-Ke Meng, Wei-Shi Zheng

TL;DR

The paper addresses the lack of egocentric full-body action understanding by introducing EgoExo-Fitness, a synchronized ego-exo video dataset for fitness activities. It provides two-level temporal boundaries and interpretable action-judgement annotations to support tasks across what, when, and how well. Benchmarks are constructed for action classification, localization, cross-view sequence verification, cross-view skill determination, and a novel Guidance-based Execution Verification, with baselines showing cross-view challenges and opportunities. The dataset enables cross-view modeling, action-guided feedback, and practical AI fitness coaching.

Abstract

We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement--including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of "what", "when", and "how well". To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action classification, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Code and data will be available at https://github.com/iSEE-Laboratory/EgoExo-Fitness/tree/main.

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

TL;DR

The paper addresses the lack of egocentric full-body action understanding by introducing EgoExo-Fitness, a synchronized ego-exo video dataset for fitness activities. It provides two-level temporal boundaries and interpretable action-judgement annotations to support tasks across what, when, and how well. Benchmarks are constructed for action classification, localization, cross-view sequence verification, cross-view skill determination, and a novel Guidance-based Execution Verification, with baselines showing cross-view challenges and opportunities. The dataset enables cross-view modeling, action-guided feedback, and practical AI fitness coaching.

Abstract

We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement--including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of "what", "when", and "how well". To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action classification, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Code and data will be available at https://github.com/iSEE-Laboratory/EgoExo-Fitness/tree/main.
Paper Structure (27 sections, 3 equations, 9 figures, 17 tables)

This paper contains 27 sections, 3 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: An Overview of our work. (a) We introduce a new video dataset, namely EgoExo-Fitness, which features synchronized egocentric and exocentric videos of fitness activities to support future work on egocentric full-body action understanding. (b) EgoExo-Fitness provides abundant annotations, including two-level temporal boundaries and interpretable action judgement. (c) We benchmark EgoExo-Fitness on five relevant tasks. Zoom in for the best view.
  • Figure 2: The setup of our recording system. We capture forward and downward egocentric videos by developing a headset containing three action cameras. To record exocentric videos, three cameras are located at the front, left-front and right-front sides of the actor. Zoom in for the best view.
  • Figure 3: Overview of annotations setups.(a) Two-level temporal boundaries are provided. Specifically, $1^{st}$-level boundaries ($t_{st}$ and $t_{ed}$) localize the single actions from the action sequence video (obtaining single action videos). After that, $2^{nd}$-level boundaries ($t^{'}_{st}$ and $t^{'}_{ed}$) separate every single action into three sub-steps ( i.e., getting ready, executing, and relaxing). (b) EgoExo-Fitness contains three types of annotations on action judgement, including keypoint verification (KP: keypoint; Ver: verification result), natural language comments, and action quality scores. Zoom in for the best view.
  • Figure 4: Statistics of the proposed EgoExo-Fitness dataset.
  • Figure 5: Overview of GEVFormer. (a) GEVFormer takes an action video and technical keypoints as input, and output the verification results on each keypoint. (b) During training, a synchronized video alignment loss is adopted to force the model to obtain consistent representations across synchronized videos from various views.
  • ...and 4 more figures