Table of Contents
Fetching ...

PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction

Kanglei Zhou, Hubert P. H. Shum, Frederick W. B. Li, Xingxing Zhang, Xiaohui Liang

TL;DR

The paper tackles the domain-shift problem in long-term Action Quality Assessment by separating task-level and feature-level shifts and introducing Progressive Hierarchical Instruction (PHI). PHI combines Gap Minimization Flow (GMF) for shallow-to-deep feature adaptation with Temporally-Enhanced Self-Attention (TESA) and a List-wise Contrastive Regularization (LCR) module to enforce coarse-to-fine alignment, without requiring explicit target distributions. Empirical results on RG, Fis-V, and LOGO show state-of-the-art performance, with substantial gains over shift-unaware and task-level methods, and ablations validate the contributions of GMF, LCR, and KL-based alignment. The framework demonstrates strong robustness and efficiency, suggesting practical impact for long-form video scoring and potential extensions to multi-modal or short-term AQA tasks.

Abstract

Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.

PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction

TL;DR

The paper tackles the domain-shift problem in long-term Action Quality Assessment by separating task-level and feature-level shifts and introducing Progressive Hierarchical Instruction (PHI). PHI combines Gap Minimization Flow (GMF) for shallow-to-deep feature adaptation with Temporally-Enhanced Self-Attention (TESA) and a List-wise Contrastive Regularization (LCR) module to enforce coarse-to-fine alignment, without requiring explicit target distributions. Empirical results on RG, Fis-V, and LOGO show state-of-the-art performance, with substantial gains over shift-unaware and task-level methods, and ablations validate the contributions of GMF, LCR, and KL-based alignment. The framework demonstrates strong robustness and efficiency, suggesting practical impact for long-form video scoring and potential extensions to multi-modal or short-term AQA tasks.

Abstract

Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.

Paper Structure

This paper contains 46 sections, 14 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustrations of our main idea: (a) The pre-trained I3D backbone emphasizes coarse features like guardrails (highlighted in yellow boxes), which may be irrelevant to scoring for AQA, while it can accurately recognize cartwheeling in the action recognition domain. This discrepancy is primarily because the pre-trained task's broader focus on coarse-level features, whereas fine-grained features essential for AQA may not be adequately exploited. (b) We identify two distinct types of domain shift: task-level discrepancies and feature-level discrepancies. (c) Based on two hypotheses, our approach innovates with a shallow-to-deep adaptation using Gap Minimization Flow (GMF), enabling a fast and controllable path to thoroughly minimize the domain gap. Additionally, we introduce a coarse-to-fine alignment mechanism using List-wise Contrastive Regularization (LCR) to enable the model to focus on fine-grained features, essential for AQA, while mitigating domain shift by refining coarse features from the broader pre-trained task.
  • Figure 2: Framework of PHI: Our PHI framework addresses the domain shift issue through two crucial processes. Firstly, Gap Minimization Flow (GMF) progressively transforms the initial feature into the desired AQA-specific features, minimizing the domain gap. Secondly, List-wise Contrastive Regularization (LCR) guides the model towards subtle variations in actions, facilitating the transition from coarse to fine-grained features crucial for AQA. Finally, the refined feature is used to predict the quality score through an MLP.
  • Figure 3: Illustration of Temporally-Enhanced Transformer Encoder (TETE): Highlighting the attention of Temporally-Enhanced Self-Attention (TESA), we employ the low-rank matrix to reduce the complexity from $\mathcal{O}(M^2D)$ to $\mathcal{O}(Md_{t}D)$, ensuring efficient modeling of long-term dependencies.
  • Figure 4: Illustration of List-wise Contrastive Regularization (LCR): LCR aligns the distance distribution in the feature space with that of the quality score space in a list-wise manner. This ensures comprehensive comparison and alignment across the entire batch of data, leading to robust performance.
  • Figure 5: Results of (a) SRCC and (b) R-$\ell_2$ on the impact of different steps. The symbol "$\uparrow$" indicates higher is better, while the symbol "$\downarrow$" indicates lower is better.
  • ...and 5 more figures