Table of Contents
Fetching ...

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Taein Son, Soo Won Seo, Jisong Kim, Seok Hwan Lee, Jun Won Choi

TL;DR

JoVALE tackles video action detection by fusing audio, visual, and scene-descriptive contexts in an actor-centric Transformer framework. It introduces the Actor-centric Multi-modal Fusion Network (AMFN) with two key modules, Multi-modal Feature Encoding (MFE) and Multi-modal Feature Aggregation (MFA), leveraging Temporal Bottleneck Features for efficiency and an adaptive gated fusion to weigh modalities per actor. With BLIP-based scene context and off-the-shelf person detectors, JoVALE achieves state-of-the-art frame-level mAP on AVA ($ ext{mAP} = $ $40.1$) and strong results on UCF101-24 and JHMDB51-21, while remaining effective even when audio is sparse. The approach demonstrates that selective, per-actor multi-modal fusion substantially improves localization and action classification, offering practical benefits for robust VAD in diverse real-world scenes, and points to future work with Vision-Language Foundation Models.

Abstract

Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks, including AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

TL;DR

JoVALE tackles video action detection by fusing audio, visual, and scene-descriptive contexts in an actor-centric Transformer framework. It introduces the Actor-centric Multi-modal Fusion Network (AMFN) with two key modules, Multi-modal Feature Encoding (MFE) and Multi-modal Feature Aggregation (MFA), leveraging Temporal Bottleneck Features for efficiency and an adaptive gated fusion to weigh modalities per actor. With BLIP-based scene context and off-the-shelf person detectors, JoVALE achieves state-of-the-art frame-level mAP on AVA ( ) and strong results on UCF101-24 and JHMDB51-21, while remaining effective even when audio is sparse. The approach demonstrates that selective, per-actor multi-modal fusion substantially improves localization and action classification, offering practical benefits for robust VAD in diverse real-world scenes, and points to future work with Vision-Language Foundation Models.

Abstract

Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks, including AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

Paper Structure

This paper contains 31 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of JoVALE: (top-left) The proposed JoVALE integrates audio, visual, and scene-descriptive features using an AMFN. (bottom-left) JoVALE leverages a VLP model fine-tuned on an image captioning task to generate scene-descriptive features. (right) AMFN encodes high-level interactions between multi-modal features through MFE and MFA.
  • Figure 2: Structure of AMFN: Three independent MFEs encode the context features within each modality. Then, MFA combines Action Embeddings derived from each modality.
  • Figure 3: Structure of Multi-modal Feature Encoding. This illustration depicts the process for the visual modality. Identical structures are applied individually to other modalities.
  • Figure 4: Structure of Multi-modal Feature Aggregation.
  • Figure 5: Different multi-modal fusion strategies: The symbol ⓒ denotes the channel-wise concatenation.
  • ...and 1 more figures