Table of Contents
Fetching ...

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

TL;DR

The paper tackles dense multi-label action detection where actions overlap in time. It introduces a two-branch transformer framework (Assistant and Core) to separately learn co-occurrence relations and dense temporal features, with a non-hierarchical Core that preserves temporal positional information via relative positional encoding. A novel learning paradigm transfers co-occurrence knowledge from the Assistant branch to the Core during training, enabling explicit modeling of action dependencies without incurring extra inference cost. Across Charades and MultiTHUMOS, the approach achieves state-of-the-art per-frame mAP and gains on action-conditional metrics, validating the effectiveness of position-aware, non-hierarchical temporal modeling combined with training-time co-occurrence supervision. Limitations include reliance on a pre-trained Vid-Enc, suggesting future work on end-to-end spatial-temporal learning and multimodal cues such as audio.

Abstract

Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results.

An Effective-Efficient Approach for Dense Multi-Label Action Detection

TL;DR

The paper tackles dense multi-label action detection where actions overlap in time. It introduces a two-branch transformer framework (Assistant and Core) to separately learn co-occurrence relations and dense temporal features, with a non-hierarchical Core that preserves temporal positional information via relative positional encoding. A novel learning paradigm transfers co-occurrence knowledge from the Assistant branch to the Core during training, enabling explicit modeling of action dependencies without incurring extra inference cost. Across Charades and MultiTHUMOS, the approach achieves state-of-the-art per-frame mAP and gains on action-conditional metrics, validating the effectiveness of position-aware, non-hierarchical temporal modeling combined with training-time co-occurrence supervision. Limitations include reliance on a pre-trained Vid-Enc, suggesting future work on end-to-end spatial-temporal learning and multimodal cues such as audio.

Abstract

Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results.
Paper Structure (14 sections, 19 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 14 sections, 19 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: A sample video and its corresponding action annotations from the Charades dataset (charades) where the video includes several action types with different time spans, from short to long, and in each time step, multiple actions can occur at the same time.
  • Figure 2: The overall schema of our proposed network that includes two branches: Assistant and Core. The Assistant branch comprises the multi-label relationship (ML-Rel) and multi-label classification (ML-CLAS) modules. The Core branch consists of a video encoder (Vid-Enc) and three main modules: fine detection (Fine-Det), coarse detection (Coarse-Det), and video classification (Vid-CLAS). During training, both the Assistant and Core branches are employed, while at the inference, only the Core branch is deployed.
  • Figure 3: Architecture of our proposed relative positional transformer RPT block. An RPT block consists of a multi-head self-attention layer with the relative positional embedding followed by a local relational LR component. For brevity, the computation of the heads are not shown separately.
  • Figure 4: The proposed hierarchical structure in (dai2022mszhang2022actionformer) vs. our proposed non-hierarchical design in the Fine-Det and Coarse-Det modules to extract multi-scale features for action detection.
  • Figure 5: Architecture of our proposed Coarse-Det module that includes $F$ granularity branches to exploit different scales of temporal dependencies from the fine-grained features.
  • ...and 1 more figures