Table of Contents
Fetching ...

FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng

TL;DR

The paper introduces FLEX, a large-scale, multimodal dataset for fitness Action Quality Assessment that pairs five-view RGB video, 3D pose, surface EMG, and physiological signals across 20 weight-loaded actions performed by 38 subjects. It builds a Fitness Knowledge Graph to ground action steps, errors, and corrective feedback, and couples this with a compositional scoring function and a FLEX-VideoQA benchmark to enable structured, multimodal reasoning. Baseline experiments demonstrate that multimodal inputs and multiview video significantly enhance AQA performance, while VideoQA and Video2EMG tasks showcase cross-modal reasoning and muscle-activity estimation from video. The dataset supports biomechanically grounded representation learning and interpretable quality assessment, with potential for AI-powered fitness coaching and coaching tools; data and code are publicly available via the provided links.

Abstract

Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.

FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

TL;DR

The paper introduces FLEX, a large-scale, multimodal dataset for fitness Action Quality Assessment that pairs five-view RGB video, 3D pose, surface EMG, and physiological signals across 20 weight-loaded actions performed by 38 subjects. It builds a Fitness Knowledge Graph to ground action steps, errors, and corrective feedback, and couples this with a compositional scoring function and a FLEX-VideoQA benchmark to enable structured, multimodal reasoning. Baseline experiments demonstrate that multimodal inputs and multiview video significantly enhance AQA performance, while VideoQA and Video2EMG tasks showcase cross-modal reasoning and muscle-activity estimation from video. The dataset supports biomechanically grounded representation learning and interpretable quality assessment, with potential for AI-powered fitness coaching and coaching tools; data and code are publicly available via the provided links.

Abstract

Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel VideoEMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.

Paper Structure

This paper contains 44 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: An overview of the FLEX dataset. FLEX dataset consists of a core group of 38 subjects, each performing 20 different fitness actions, repeating each action 10 times. Each action repeat was recorded from 5 viewpoints, & sEMG signals and physiological parameters (heart rate, breath rate) were simultaneously collected along with videos. The data annotations contain rich text information such as action keysteps (AK), error types (ET), & action feedback. (Zoom in for the best view.)
  • Figure 2: Comparison between FLEX and existing SOTA fitness datasets. EWL: equipment (barbell, dumbbell, etc.)-based weight loading; RI: level of risk of injury.
  • Figure 3: Data collection environment. Four cinema cameras and one smartphone were fixed at the four corners of the collection area. Video, sEMG, heart rate, and breath rate are recorded synchronously during collection.
  • Figure 4: Annotation Process. Annotators were trained on the provided guidelines and received centralized instruction to ensure full understanding of the rules. The video data was segmented following predetermined criteria, and a two-stage annotation process was implemented to reduce annotation errors and mitigate subjective bias.
  • Figure 5: Performance of AQA models on FLEX dataset.$R-l2 (\times100)$
  • ...and 9 more figures