Table of Contents
Fetching ...

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan Dutta

TL;DR

The paper tackles actor-agnostic, multi-label action recognition across humans and animals by removing actor pose dependencies and leveraging both visual and textual information. It introduces MSQNet, a DETR-like transformer architecture with three components: a spatio-temporal video encoder, a multi-modal query encoder, and a multi-modal decoder that uses vision-language queries to perform multi-label action detection. By initializing class queries from text embeddings and incorporating CLIP-based image features, MSQNet achieves superior performance on five benchmarks in both fully supervised and zero-shot settings, with notable gains on the diverse Animal Kingdom dataset. The work demonstrates that fusing vision-language cues in a transformer-based detection framework yields better action representations, reduces reliance on actor-specific cues, and enables robust zero-shot generalization, offering practical benefits in generalization and maintenance. Code and models are provided to facilitate replication and broader adoption.

Abstract

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

TL;DR

The paper tackles actor-agnostic, multi-label action recognition across humans and animals by removing actor pose dependencies and leveraging both visual and textual information. It introduces MSQNet, a DETR-like transformer architecture with three components: a spatio-temporal video encoder, a multi-modal query encoder, and a multi-modal decoder that uses vision-language queries to perform multi-label action detection. By initializing class queries from text embeddings and incorporating CLIP-based image features, MSQNet achieves superior performance on five benchmarks in both fully supervised and zero-shot settings, with notable gains on the diverse Animal Kingdom dataset. The work demonstrates that fusing vision-language cues in a transformer-based detection framework yields better action representations, reduces reliance on actor-specific cues, and enables robust zero-shot generalization, offering practical benefits in generalization and maintenance. Code and models are provided to facilitate replication and broader adoption.

Abstract

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.
Paper Structure (13 sections, 6 equations, 5 figures, 6 tables)

This paper contains 13 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of large action variation across different actors (e.g., animals and humans). Such differences often motivate the development of actor-specific action recognition models, such as using actor specific pose estimation ng2022animal
  • Figure 2: Overview of our MSQNet for multi-modal multi-label action recognition. It has three components: a spatio-temporal video encoder, a vision-language query encoder and a multi-modal decoder. The video encoder extracts the spatio-temporal features from an input video, the query encoder merges the visual and textual information, and the multi-modal decoder transforms the video encoding to make multi-label classification with a feed-forward network (FFN).
  • Figure 3: Attention rollout on sample videos from Animal Kingdom ng2022animal showing raw frames, heatmap with only bare backbone, with uni-modal prompt, and MSQNet.
  • Figure 4: Video embeddings without and with the proposed multi-modal query learning on Animal Kingdom and Charades. Arrow shows the transition.
  • Figure 5: Confidence scores of top-5 classes predicted by our MSQNet. Correctly classified action classes are marked with a ✓.