Table of Contents
Fetching ...

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

Ning Wang, Yun Xiao, Xiaopeng Peng, Xiaojun Chang, Xuanhong Wang, Dingyi Fang

TL;DR

This model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability, and introduces a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time.

Abstract

Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

TL;DR

This model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability, and introduces a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time.

Abstract

Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.

Paper Structure

This paper contains 16 sections, 14 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Accurate detection of action segments from a video sequence relies on discriminating salient information from long-term context. In additional to identifying the salient context in a long context, preserving context integrity and diversity as well as fine-grained local features are equally important. For example, distinguishing actions such as high jump and long jump may benefit from recognizing the most salient and relevant contexts. On the other hand, ensuring the completeness of the long-range context which include a diverse relevance may also provide significant cues to improve the accuracy of the detection of actions like getting a hair cut.
  • Figure 2: Illustration of the proposed single-stage ContextDet model for temporal action detection (TAD). (a) The architecture of the ContextDet model is comprised of a pre-trained video backbone (e.g., I3D, VideoMAEv2, etc) as the feature extractor $FE$, a convolutional projection layer $FP$, five adaptive context aggregation (ACA) levels $i = 1,...,5$, and a pre-trained action detection head. The TAD pipeline starts from extracting video features from an input video $V = {v_t|t = 0,...,T}$ of a total number of $T$ frames. These video features are projected by the convolution layer $FP$ producing the projected video feature $F_0\in\mathbb{R}^{T_{0}\times D}$. This input feature passes the ACA layers, the outputs of each ACA layer are used to predict actions $\hat{\phi_t} = (\hat{s}_t, \hat{c}_t, \hat{e}_t)$. (b) Each ACA level starts with a downsampling layer, which reduces the dimension of the input feature $F_{i-1}$ by a half. The downsampled feature $F_{i-1}\downarrow$ the passes a layernorm (LN) layer. The following context attention module (CAM) consists of a Q-branch and a K-branch that diverted respectively by the linear layers $L_Q$ and $L_K$. The output of the Q-branch is sent into a context gating block (CGB), producing a gated feature $G_i$. The output of CAM $A_i$ is the result of the K-branch results $K_i$ modulated by the gated feature. The long context module (LCM) makes use of a large-kernel convolution and a number of $N$ small-kernal convolutions to capture the long-range context and fine-grained local features respectively. These context information are fused together producing the output $D_i$. The input video feature, the salient context and the long context information are then fused by a final LN and an MLP layer producing the output video feature $F_i$. (c) The illustration of the CGB, where $Z_i = \{z_{i,m}\}$ features are extracted by set of $M$ depth-wise convolution kernels $\mathrm{DWConv}_m$ and modulated by corresponding weights $W_i = \{w_{i,m}\}$ for $m = 1,...M$. There CNN kernels are varying in scales. The weights $W_i$ are calculated from the feature $Z_i$ by fusing the Max Pooling and the Average Pooling features via an additional Conv-Signoid layer. These weights modulate the CGB features to capture the context saliency that is most relevant to the action while preserving contextual integrity and diversity.
  • Figure 3: Qualitative evaluations of our ContextDet model and the Tridet shi2023tridet model on two video clips from the THUMOS14 dataset, showcasing the actions playing billiard and long jump respectively. In each case, the yellow bar represents the ground truth, and the green and pink bars indicate respectively the detection results of our model and the Tridet model. Our model produces more accurate prediction of the starting point, the ending point, and the duration of the actions in both cases.
  • Figure 4: Qualitative results of our ContextDet model with VideoMAEv2 wang2023videomaev2 features on four video clips (a)-(d) from the Thumos14 test set. The red bars above the line represent the ground truth, and the blue bars below showcase the predicted action segments with the top 20 accuracies. The darkness of the color indicates the degree of overlapping of the results with the ground truth.
  • Figure 5: The sensitivity analysis of (a) the baseline model Lin_2021_CVPR1 and (b) our ContextDet model to action characteristics. Left: each bar measures the average-$\mathrm{mAP_N}$ value at tIoU=0.5 on a subset of Thumos14 dataset that features a particular action characteristic. The dotted lines indicate the mean average-$\mathrm{mAP_N}$. Right: A summary of the left, where the sensitivity is given by the difference between the max and min average-$\mathrm{mAP_N}$ values.
  • ...and 2 more figures