Classification Matters: Improving Video Action Detection with Class-Specific Attention
Jinsung Lee, Taeoh Kim, Inwoong Lee, Minho Shim, Dongyoon Wee, Minsu Cho, Suha Kwak
TL;DR
This work reframes video action detection as a classification-centric problem by introducing class-specific attention through class queries that guide where the model should look for action cues. A 3D Deformable Transformer Encoder processes multi-scale spatio-temporal features, while a dual-decoder design with Localizing Decoder Layer and Classifying Decoder Layer enables simultaneous actor localization and class-specific classification, with class queries ensuring action-specific context is leveraged beyond actor regions. Training relies on Hungarian matching and a composite loss to align boxes, classes, and confidences, yielding state-of-the-art accuracy with greater efficiency across AVA, JHMDB-21, and UCF101-24. The approach provides interpretable, class-specific attention maps, facilitates robust classification, and demonstrates practical benefits in computational efficiency for long video tubes. Limitations include lack of explicit temporal information exchange in the decoder, pointing to future work on memory-efficient temporal modeling.
Abstract
Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, our model can dynamically determine where to focus for effective classification. The proposed model demonstrates superior performance on three challenging benchmarks with significantly fewer parameters and less computation.
