FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

Hao Yin; Lijun Gu; Paritosh Parmar; Lin Xu; Tianxiao Guo; Weiwei Fu; Yang Zhang; Tianyou Zheng

FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng

TL;DR

The paper introduces FLEX, a large-scale, multimodal dataset for fitness Action Quality Assessment that pairs five-view RGB video, 3D pose, surface EMG, and physiological signals across 20 weight-loaded actions performed by 38 subjects. It builds a Fitness Knowledge Graph to ground action steps, errors, and corrective feedback, and couples this with a compositional scoring function and a FLEX-VideoQA benchmark to enable structured, multimodal reasoning. Baseline experiments demonstrate that multimodal inputs and multiview video significantly enhance AQA performance, while VideoQA and Video2EMG tasks showcase cross-modal reasoning and muscle-activity estimation from video. The dataset supports biomechanically grounded representation learning and interpretable quality assessment, with potential for AI-powered fitness coaching and coaching tools; data and code are publicly available via the provided links.

Abstract

Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.

FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

TL;DR

Abstract

FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)