Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

Faegheh Sardari; Armin Mustafa; Philip J. B. Jackson; Adrian Hilton

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

TL;DR

This work tackles dense action detection by decoupling it into two unambiguous sub-problems—entity and motion concepts—solved by dedicated sub-networks (Action-Entity and Action-Motion). It introduces a language-guided contrastive learning loss (LRD_CoLV) to explicitly supervise co-occurring concepts, aligning video embeddings with text descriptions of co-occurring classes. The RefDense architecture uses label decomposition via prompts and LLMs, cross-attention between sub-networks, and fusion for final predictions, achieving state-of-the-art results on Charades and MultiTHUMOS with notable gains in both standard and action-conditional metrics. The approach also demonstrates that the proposed loss can improve existing models, underscoring its generality for dense action detection and potential extension to multi-modal settings.

Abstract

Dense action detection involves detecting multiple co-occurring actions while action classes are often ambiguous and represent overlapping concepts. We argue that handling the dual challenge of temporal and class overlaps is too complex to effectively be tackled by a single network. To address this, we propose to decompose the task of detecting dense ambiguous actions into detecting dense, unambiguous sub-concepts that form the action classes (i.e., action entities and action motions), and assigning these sub-tasks to distinct sub-networks. By isolating these unambiguous concepts, the sub-networks can focus exclusively on resolving a single challenge, dense temporal overlaps. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve the method performance. However, current dense action detection networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial improvements of 3.8% and 1.7% on average across all metrics on the challenging benchmark datasets, Charades and MultiTHUMOS.

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

TL;DR

Abstract

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)