Table of Contents
Fetching ...

JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

Seok Hwan Lee, Taein Son, Soo Won Seo, Jisong Kim, Jun Won Choi

TL;DR

A new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS) is proposed, which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.

Abstract

Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21.

JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

TL;DR

A new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS) is proposed, which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.

Abstract

Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21.
Paper Structure (39 sections, 10 equations, 10 figures, 14 tables)

This paper contains 39 sections, 10 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Comparison of JARViS with other VAD architectures. JARViS employs a two-stage pipeline that initially generates densely sampled actor semantics using a pre-trained person detector, and then leverages their relationship with spatio-temporal scene context features to produce the final set of actions.
  • Figure 2: Overall architecture of JARViS model. JARViS produces actor proposal features and scene context features by applying separate backbone networks to the keyframe image and the video clip, respectively. These features are linearly mapped to the embedding vectors of the same size. The transformer encoder then transforms the embedding vectors into the final action proposal features. Finally, the action classification results are obtained through the MLP head.
  • Figure 3: JARViS with a long-term video clip. JARViS can detect actions from a long-term video clip. Given a fixed keyframe, JARViS produces the action scores based on a short-term video sequence in a sliding window. These action scores, obtained each time the window moves, are aggregated with trainable weights. Note that the combined weights vary depending on the action class and the relative window position away from the keyframe.
  • Figure 4: Different relation modeling architectures. Note that $L$ denotes the number of layers and the blue, red, and green tokens represent the scene context embedding, actor embedding, and their joint representations, respectively.
  • Figure 5: Performance of JARViS versus ACAR for each action class on the AVA v2.2.
  • ...and 5 more figures