Prediction-Feedback DETR for Temporal Action Detection

Jihwan Kim; Miso Lee; Cheol-Ho Cho; Jihyun Lee; Jae-Pil Heo

Prediction-Feedback DETR for Temporal Action Detection

Jihwan Kim, Miso Lee, Cheol-Ho Cho, Jihyun Lee, Jae-Pil Heo

TL;DR

Pred-DETR tackles attention collapse in DETR-based temporal action detection by introducing prediction-guided feedback that aligns cross- and self-attention with IoU-based similarity maps derived from model predictions, rather than relying on collapsed attention. The method reformulates cross-attention as a relation between decoder queries and uses predicted IoU relations $P^d_{QQ}$ and encoder-based $P^e_{QQ}$ to regularize attention via KL-divergence losses, with an extended training-time encoder guidance. The full objective combines the standard DETR loss with three prediction-feedback terms, yielding improved attention diversity and robust localization across THUMOS14, ActivityNet-v1.3, HACS, and FineAction, achieving state-of-the-art results among DETR-based TAD methods. This work demonstrates that aligning attention with predictive structure can remedy collapse and enhance end-to-end transformer-based temporal localization, suggesting broader applicability of prediction-guided attention in video understanding.

Abstract

Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.

Prediction-Feedback DETR for Temporal Action Detection

TL;DR

and encoder-based

to regularize attention via KL-divergence losses, with an extended training-time encoder guidance. The full objective combines the standard DETR loss with three prediction-feedback terms, yielding improved attention diversity and robust localization across THUMOS14, ActivityNet-v1.3, HACS, and FineAction, achieving state-of-the-art results among DETR-based TAD methods. This work demonstrates that aligning attention with predictive structure can remedy collapse and enhance end-to-end transformer-based temporal localization, suggesting broader applicability of prediction-guided attention in video understanding.

Abstract

Paper Structure (16 sections, 12 equations, 7 figures, 7 tables)

This paper contains 16 sections, 12 equations, 7 figures, 7 tables.

Introduction
Related Work
Temporal Action Detection
DETR
Our Approach
Preliminary
Prediction-Feedback
Objectives
Experiments
Datasets
Implementation Details
Main Results
Analysis
Conclusion
Additional Details
...and 1 more sections

Figures (7)

Figure 1: Attention collapse problem. The figure depicts the cross- ((a), (c)) and self-attention maps ((e), (g)) of the decoder as well as the predictions ((b), (d)) and their normalized IoU similarity map ((f), (h)). DETR for TAD with standard attention severely suffers from the attention collapse in its cross-attention and self-attention ((a), (e)). The collapsed attention focuses on a few encoder features (a) or decoder queries (e) regardless of the DETR predictions ((b), (f)).
Figure 2: Overall architecture of the proposed framework, Pred-DETR. The figure illustrates the entire framework of our model, Pred-DETR. Pred-DETR consists of the two main parts: DETR architecture and prediction-feedback. The encoder and decoder predictions are converted to the relation of Intersection-over-Union (IoU). Then these IoU maps are utilized for prediction-feedback for the collapsed self- and cross-attention. Note that the encoder predictions are deployed only for training.
Figure 3: Prediction-Feedback. This illustrates the detailed mechanism of prediction-feedback for the cross-attention. The DETR predictions are diverse thanks to the bipartite matching. By aligning attention with the IoU relation from the predictions, the query relation is recovered, alleviating the attention collapse.
Figure 4: Diversity of attention maps. Diversity for cross- and self-attention for test samples of ActivityNet-v1.3.
Figure 5: Attention maps. The figure shows self- and cross-attention maps from a test sample in ActivityNet-v1.3.
...and 2 more figures

Prediction-Feedback DETR for Temporal Action Detection

TL;DR

Abstract

Prediction-Feedback DETR for Temporal Action Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)