Table of Contents
Fetching ...

Classification Matters: Improving Video Action Detection with Class-Specific Attention

Jinsung Lee, Taeoh Kim, Inwoong Lee, Minho Shim, Dongyoon Wee, Minsu Cho, Suha Kwak

TL;DR

This work reframes video action detection as a classification-centric problem by introducing class-specific attention through class queries that guide where the model should look for action cues. A 3D Deformable Transformer Encoder processes multi-scale spatio-temporal features, while a dual-decoder design with Localizing Decoder Layer and Classifying Decoder Layer enables simultaneous actor localization and class-specific classification, with class queries ensuring action-specific context is leveraged beyond actor regions. Training relies on Hungarian matching and a composite loss to align boxes, classes, and confidences, yielding state-of-the-art accuracy with greater efficiency across AVA, JHMDB-21, and UCF101-24. The approach provides interpretable, class-specific attention maps, facilitates robust classification, and demonstrates practical benefits in computational efficiency for long video tubes. Limitations include lack of explicit temporal information exchange in the decoder, pointing to future work on memory-efficient temporal modeling.

Abstract

Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, our model can dynamically determine where to focus for effective classification. The proposed model demonstrates superior performance on three challenging benchmarks with significantly fewer parameters and less computation.

Classification Matters: Improving Video Action Detection with Class-Specific Attention

TL;DR

This work reframes video action detection as a classification-centric problem by introducing class-specific attention through class queries that guide where the model should look for action cues. A 3D Deformable Transformer Encoder processes multi-scale spatio-temporal features, while a dual-decoder design with Localizing Decoder Layer and Classifying Decoder Layer enables simultaneous actor localization and class-specific classification, with class queries ensuring action-specific context is leveraged beyond actor regions. Training relies on Hungarian matching and a composite loss to align boxes, classes, and confidences, yielding state-of-the-art accuracy with greater efficiency across AVA, JHMDB-21, and UCF101-24. The approach provides interpretable, class-specific attention maps, facilitates robust classification, and demonstrates practical benefits in computational efficiency for long video tubes. Limitations include lack of explicit temporal information exchange in the decoder, pointing to future work on memory-efficient temporal modeling.

Abstract

Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, our model can dynamically determine where to focus for effective classification. The proposed model demonstrates superior performance on three challenging benchmarks with significantly fewer parameters and less computation.
Paper Structure (29 sections, 15 equations, 17 figures, 13 tables)

This paper contains 29 sections, 15 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Detection performance changes of the state-of-the-art methods (i.e., TubeR zhao2022tuber, EVAD chen2023efficient, STMixer wu2023stmixer) on AVA gu2018ava when ground-truth boxes or class labels are given.
  • Figure 1: Ablation experiments on each module of the model.
  • Figure 2: Sample detection results and classification attention maps of the previous transformer-based model, TubeR zhao2022tuber, EVAD chen2023efficient, and our model. Each attention map signifies the regions where the model attends to classify the action of the actor marked in the bounding box of the same color. Since our model creates an attention map for each class, we mark the corresponding label under the map. Best viewed in color.
  • Figure 3: Overview of the proposed model
  • Figure 4: Structure of transformer decoder layers of our model. We use ${\texttt{\scriptsize {c}}}$, $\odot$, and $\oplus$ to indicate concatenation, multiplication, and summation. In \ref{['fig:CDL_simple']}, we denote variables with the actor index $i$ to describe the process simpler.
  • ...and 12 more figures