PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction
Kanglei Zhou, Hubert P. H. Shum, Frederick W. B. Li, Xingxing Zhang, Xiaohui Liang
TL;DR
The paper tackles the domain-shift problem in long-term Action Quality Assessment by separating task-level and feature-level shifts and introducing Progressive Hierarchical Instruction (PHI). PHI combines Gap Minimization Flow (GMF) for shallow-to-deep feature adaptation with Temporally-Enhanced Self-Attention (TESA) and a List-wise Contrastive Regularization (LCR) module to enforce coarse-to-fine alignment, without requiring explicit target distributions. Empirical results on RG, Fis-V, and LOGO show state-of-the-art performance, with substantial gains over shift-unaware and task-level methods, and ablations validate the contributions of GMF, LCR, and KL-based alignment. The framework demonstrates strong robustness and efficiency, suggesting practical impact for long-form video scoring and potential extensions to multi-modal or short-term AQA tasks.
Abstract
Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.
