Interpretable Long-term Action Quality Assessment

Xu Dong; Xinran Liu; Wanqing Li; Anthony Adeyemi-Ejeye; Andrew Gilbert

Interpretable Long-term Action Quality Assessment

Xu Dong, Xinran Liu, Wanqing Li, Anthony Adeyemi-Ejeye, Andrew Gilbert

TL;DR

This paper tackles the interpretability gap in long-term Action Quality Assessment by identifying Temporal Skipping in transformer decoders as a core issue. It introduces a DETR-inspired architecture with learnable, temporally encoded queries, an Attention Loss to preserve cross- and self-attention correlations, and a query initialization strategy to maintain temporal structure. A Weight-Score Regression Head decouples clip-level weight (difficulty) from score (execution quality), enabling finer, human-aligned interpretability and a robust final score computed across clips. The approach achieves state-of-the-art results on three long-term AQA benchmarks (RG, Fis-V, LOGO) and provides clearer, clip-level semantic explanations, with code available for reproducibility.

Abstract

Long-term Action Quality Assessment (AQA) evaluates the execution of activities in videos. However, the length presents challenges in fine-grained interpretability, with current AQA methods typically producing a single score by averaging clip features, lacking detailed semantic meanings of individual clips. Long-term videos pose additional difficulty due to the complexity and diversity of actions, exacerbating interpretability challenges. While query-based transformer networks offer promising long-term modeling capabilities, their interpretability in AQA remains unsatisfactory due to a phenomenon we term Temporal Skipping, where the model skips self-attention layers to prevent output degradation. To address this, we propose an attention loss function and a query initialization method to enhance performance and interpretability. Additionally, we introduce a weight-score regression module designed to approximate the scoring patterns observed in human judgments and replace conventional single-score regression, improving the rationality of interpretability. Our approach achieves state-of-the-art results on three real-world, long-term AQA benchmarks. Our code is available at: https://github.com/dx199771/Interpretability-AQA

Interpretable Long-term Action Quality Assessment

TL;DR

Abstract

Paper Structure (13 sections, 8 equations, 4 figures, 5 tables)

This paper contains 13 sections, 8 equations, 4 figures, 5 tables.

Introduction
Related Work
Action Quality Assessment
Method
Experiment
Datasets and Metrics
Implementation Details
Results and Analysis
Ablation Study
Effect of position encoding
Effect of variance in query initialization module
Sequence Interpretability
Conclusion

Figures (4)

Figure 1: The visualization of the clip-level weight-score regression method illustrates that our network can adhere to the same evaluative logic as human judges in real-world scenarios. The green curve representing weight delineates the significance of the respective action clip, whereas the orange curve for score quantifies the execution quality of the action, the overall score is shown by the blue curve. All scores are normalized to a range of 0 to 1 for easier comparison.
Figure 2: Temporal Skipping problem of self-attention. This figure shows the self-attention map \ref{['fig:3']} and \ref{['fig:1']} (ours) and visualization of segmented score of each clip \ref{['fig:4']} and \ref{['fig:2']} (ours). \ref{['fig:3']} and \ref{['fig:4']} represent the same action sequences, as do \ref{['fig:1']} and \ref{['fig:2']}. We can observe that in \ref{['fig:3']}, the self-attention map severely suffers from Temporal Skipping problem where \ref{['fig:1']} shows high correlations between queries.
Figure 3: The overview architecture of our Query-based transformer decoder . The input video is divided into clips and fed into a backbone network. A temporal decoder models the clip-level features into temporal representations via learnable positionally encoded queries. The interpretable weight-score regression head can regress the final score by multiplying the weight and score of each clip. By minimizing the similarity between the self-attention map and cross-attention map, as well as query initialization, the problem of temporal collapse common in longer-term video sequences disappears and improves human interpretability.
Figure 4: Visualization of our clip-level weight-score regression method on RG dataset.

Interpretable Long-term Action Quality Assessment

TL;DR

Abstract

Interpretable Long-term Action Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (4)