Table of Contents
Fetching ...

CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment

Kanglei Zhou, Junlin Li, Ruizhi Cai, Liyuan Wang, Xingxing Zhang, Xiaohui Liang

TL;DR

Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task, which mirrors the judging process, enhancing interpretability within the AQA framework.

Abstract

Action Quality Assessment (AQA) is pivotal for quantifying actions across domains like sports and medical care. Existing methods often rely on pre-trained backbones from large-scale action recognition datasets to boost performance on smaller AQA datasets. However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA. Moreover, fine-tuning on smaller datasets risks overfitting. To address these issues, we propose Coarse-to-Fine Instruction Alignment (CoFInAl). Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task. Initially, it learns grade prototypes for coarse assessment and then utilizes fixed sub-grade prototypes for fine-grained assessment. This hierarchical approach mirrors the judging process, enhancing interpretability within the AQA framework. Experimental results on two long-term AQA datasets demonstrate CoFInAl achieves state-of-the-art performance with significant correlation gains of 5.49% and 3.55% on Rhythmic Gymnastics and Fis-V, respectively. Our code is available at https://github.com/ZhouKanglei/CoFInAl_AQA.

CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment

TL;DR

Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task, which mirrors the judging process, enhancing interpretability within the AQA framework.

Abstract

Action Quality Assessment (AQA) is pivotal for quantifying actions across domains like sports and medical care. Existing methods often rely on pre-trained backbones from large-scale action recognition datasets to boost performance on smaller AQA datasets. However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA. Moreover, fine-tuning on smaller datasets risks overfitting. To address these issues, we propose Coarse-to-Fine Instruction Alignment (CoFInAl). Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task. Initially, it learns grade prototypes for coarse assessment and then utilizes fixed sub-grade prototypes for fine-grained assessment. This hierarchical approach mirrors the judging process, enhancing interpretability within the AQA framework. Experimental results on two long-term AQA datasets demonstrate CoFInAl achieves state-of-the-art performance with significant correlation gains of 5.49% and 3.55% on Rhythmic Gymnastics and Fis-V, respectively. Our code is available at https://github.com/ZhouKanglei/CoFInAl_AQA.
Paper Structure (13 sections, 16 equations, 13 figures, 6 tables)

This paper contains 13 sections, 16 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Motivation: (a) Previous methods often fine-tune large-scale pre-trained action recognition backbones, yielding suboptimal performance due to domain shift and overfitting. (b) Our method aligns AQA with broader tasks via coarse-to-fine instruction alignment, employing grade prototype learning and fine-grained sub-grade classification with a simplex Equiangular Tight Frame (ETF).
  • Figure 2: CoFInAl Framework: The input video undergoes segmentation into clips for feature extraction using a shared backbone. The Temporal Fusion Module (TFM, see \ref{['sec:tfm']}) enhances clip features. The Grade Parsing Module (GPM, see \ref{['sec:gpm']}) then separates features into coarse-grained and fine-grained components. Predictions for coarse-grained and fine-grained scores are derived from these features through an MLP and the Fine-Grained Scoring (FGS, see \ref{['sec:fgs']}) module. Finally, the final score is coupled with predicted coarse-grained and fine-grained scores. During training, the ground truth score is decoupled to supervise coarse-to-fine learning.
  • Figure 3: Illustration of Grade Parsing Module (GPM).
  • Figure 4: SRCC bars of the number of (a) grades and (b) sub-grades.
  • Figure 5: T-SNE feature distribution plots (a, b, d) and correlation comparison plots (c, e) contrasting GDLT with our CoFInAl method.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 1: Simplex Equiangular Tight Frame