Table of Contents
Fetching ...

Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

Mengshi Qi, Yeteng Wu, Xianlin Zhang, Huadong Ma

TL;DR

This work defines Action Form Assessment (AFA) and introduces the CoT-AFA dataset together with the Explainable Fitness Assessor (EFA). EFA uses dual-branch multimodal fusion and a dynamic gating layer to jointly classify action form, assess quality, and generate Chain-of-Thought explanations grounded in predefined standard steps. The dataset provides rich hierarchical annotations and extensive CoT explanations, enabling interpretable feedback for fitness and martial arts actions. Empirical results show substantial gains in text-based explainability (CIDEr), action classification, and action quality assessment, highlighting the practical potential for explainable, actionable video analysis in real-world coaching and training contexts.

Abstract

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

TL;DR

This work defines Action Form Assessment (AFA) and introduces the CoT-AFA dataset together with the Explainable Fitness Assessor (EFA). EFA uses dual-branch multimodal fusion and a dynamic gating layer to jointly classify action form, assess quality, and generate Chain-of-Thought explanations grounded in predefined standard steps. The dataset provides rich hierarchical annotations and extensive CoT explanations, enabling interpretable feedback for fitness and martial arts actions. Empirical results show substantial gains in text-based explainability (CIDEr), action classification, and action quality assessment, highlighting the practical potential for explainable, actionable video analysis in real-world coaching and training contexts.

Abstract

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

Paper Structure

This paper contains 20 sections, 11 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustrations of standard and non-standard action examples in our proposed CoT-AFA dataset. The green and red lines indicate the correct and wrong actions, respcetively.
  • Figure 2: Three-level lexicon annotation structure of apparatus (left) and manual (right). The first colored layer outside the center of the circle represents martial arts and fitness. The next layer represents the workout type. The outermost layer refers to the action category.
  • Figure 3: Illustration of multi-element annotations of CoT-AFA dataset, which shows two categories of actions from different views, including front view, side view, and back view, with both standard and non-standard samples for each action.
  • Figure 4: The workflow for textual explanation generation process of CoT-AFA, showing the generation of Standard Technical Steps for each action using LLM and Chain-of-Thought Text Explanations by VLM, followed by a VLM-based and human expert review process to ensure quality.
  • Figure 5: The architecture of our proposed Explainable Fitness Assessor (EFA). EFA receives video frames and text as input. Visual and text features are extracted by the backbone. The Multimodal Fusion Module is used to fuse visual and text information. Finally, the framework outputs the action quality score, action class and Chain-of-Thought text explanations.
  • ...and 2 more figures