Table of Contents
Fetching ...

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

TL;DR

This work tackles dense action detection by decoupling it into two unambiguous sub-problems—entity and motion concepts—solved by dedicated sub-networks (Action-Entity and Action-Motion). It introduces a language-guided contrastive learning loss (LRD_CoLV) to explicitly supervise co-occurring concepts, aligning video embeddings with text descriptions of co-occurring classes. The RefDense architecture uses label decomposition via prompts and LLMs, cross-attention between sub-networks, and fusion for final predictions, achieving state-of-the-art results on Charades and MultiTHUMOS with notable gains in both standard and action-conditional metrics. The approach also demonstrates that the proposed loss can improve existing models, underscoring its generality for dense action detection and potential extension to multi-modal settings.

Abstract

Dense action detection involves detecting multiple co-occurring actions while action classes are often ambiguous and represent overlapping concepts. We argue that handling the dual challenge of temporal and class overlaps is too complex to effectively be tackled by a single network. To address this, we propose to decompose the task of detecting dense ambiguous actions into detecting dense, unambiguous sub-concepts that form the action classes (i.e., action entities and action motions), and assigning these sub-tasks to distinct sub-networks. By isolating these unambiguous concepts, the sub-networks can focus exclusively on resolving a single challenge, dense temporal overlaps. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve the method performance. However, current dense action detection networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial improvements of 3.8% and 1.7% on average across all metrics on the challenging benchmark datasets, Charades and MultiTHUMOS.

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

TL;DR

This work tackles dense action detection by decoupling it into two unambiguous sub-problems—entity and motion concepts—solved by dedicated sub-networks (Action-Entity and Action-Motion). It introduces a language-guided contrastive learning loss (LRD_CoLV) to explicitly supervise co-occurring concepts, aligning video embeddings with text descriptions of co-occurring classes. The RefDense architecture uses label decomposition via prompts and LLMs, cross-attention between sub-networks, and fusion for final predictions, achieving state-of-the-art results on Charades and MultiTHUMOS with notable gains in both standard and action-conditional metrics. The approach also demonstrates that the proposed loss can improve existing models, underscoring its generality for dense action detection and potential extension to multi-modal settings.

Abstract

Dense action detection involves detecting multiple co-occurring actions while action classes are often ambiguous and represent overlapping concepts. We argue that handling the dual challenge of temporal and class overlaps is too complex to effectively be tackled by a single network. To address this, we propose to decompose the task of detecting dense ambiguous actions into detecting dense, unambiguous sub-concepts that form the action classes (i.e., action entities and action motions), and assigning these sub-tasks to distinct sub-networks. By isolating these unambiguous concepts, the sub-networks can focus exclusively on resolving a single challenge, dense temporal overlaps. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve the method performance. However, current dense action detection networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial improvements of 3.8% and 1.7% on average across all metrics on the challenging benchmark datasets, Charades and MultiTHUMOS.

Paper Structure

This paper contains 9 sections, 11 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of current approaches and our proposed approach, RefDense, for tackling the dense action detection task. (a) Current approaches directly address the entire problem (i.e., detecting dense, ambiguous actions) using a single network, optimized solely with Binary Cross-Entropy (BCE) loss. In contrast, (b) RefDense decomposes the task into two sub-tasks (i.e., detecting dense, unambiguous entity and motion sub-concepts underlying the actions classes) and assigns them to distinct sub-networks. Furthermore, our approach is optimized using both BCE loss and our proposed contrastive co-occurrence language-video loss.
  • Figure 2: The overall scheme of RefDense. Our proposed network consists of two sub-networks: Action-Entity and Action-Motion. Action-Entity learns dense entity concepts associated with the action classes, while Action-Motion focuses on learning dense motion concepts related to the action classes. The entire network is optimized using the dense action labels and the BCE loss ($\mathcal{L}^{Action}_{BCE}$). Additionally, the sub-networks are optimized using dense action-entity and action-motion labels, which are derived from action labels, along with the BCE loss ($\mathcal{L}^{ent}_{BCE}$, and $\mathcal{L}^{mot}_{BCE}$) and our proposed contrastive co-occurrence language-video loss ($\mathcal{L}^{ent}_{{CoLV}}$ and $\mathcal{L}^{mot}_{{CoLV}}$).
  • Figure 3: Alignment of temporal video features with textual features of co-occurring class concepts in our contrastive co-occurrence language-video loss (i.e., $\mathcal{L}^{ent}_{{CoLV}}$ and $\mathcal{L}^{mot}_{{CoLV}}$).
  • Figure 4: Qualitative comparison with previous approaches (PAT sardari2023pat and MS-TCT dai2022ms) on a test video sample of Charades.