Prediction-Feedback DETR for Temporal Action Detection
Jihwan Kim, Miso Lee, Cheol-Ho Cho, Jihyun Lee, Jae-Pil Heo
TL;DR
Pred-DETR tackles attention collapse in DETR-based temporal action detection by introducing prediction-guided feedback that aligns cross- and self-attention with IoU-based similarity maps derived from model predictions, rather than relying on collapsed attention. The method reformulates cross-attention as a relation between decoder queries and uses predicted IoU relations $P^d_{QQ}$ and encoder-based $P^e_{QQ}$ to regularize attention via KL-divergence losses, with an extended training-time encoder guidance. The full objective combines the standard DETR loss with three prediction-feedback terms, yielding improved attention diversity and robust localization across THUMOS14, ActivityNet-v1.3, HACS, and FineAction, achieving state-of-the-art results among DETR-based TAD methods. This work demonstrates that aligning attention with predictive structure can remedy collapse and enhance end-to-end transformer-based temporal localization, suggesting broader applicability of prediction-guided attention in video understanding.
Abstract
Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.
