Table of Contents
Fetching ...

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Manuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez, Jose Garcia-Rodriguez, Antonis Argyros

TL;DR

The paper tackles action recognition by leveraging hierarchical action structures and contextual textual information to capture longer temporal dependencies. It introduces a vision-language Transformer that fuses RGB, optical-flow, and contextual text prompts (including past actions and location) and trains with a joint coarse- and fine-grained loss $L$. A Hierarchical TSU dataset is proposed, extending TSU with two-level hierarchy and contextual labels, and the approach is validated on Hierarchical TSU, IkeaASM, and Assembly101, showing substantial gains over RGB-only baselines and competitive or superior performance versus state-of-the-art methods. The findings demonstrate that long-range textual context and hierarchical cues significantly improve action understanding in ADL settings, with practical implications for assistive technologies and industrial safety, while also highlighting the need for well-aligned hierarchies to avoid noise. The work provides strong baselines and rich ablations that quantify the impact of context, hierarchy, and fusion design on action recognition performance, and it releases code and annotations to support future research.

Abstract

We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

TL;DR

The paper tackles action recognition by leveraging hierarchical action structures and contextual textual information to capture longer temporal dependencies. It introduces a vision-language Transformer that fuses RGB, optical-flow, and contextual text prompts (including past actions and location) and trains with a joint coarse- and fine-grained loss . A Hierarchical TSU dataset is proposed, extending TSU with two-level hierarchy and contextual labels, and the approach is validated on Hierarchical TSU, IkeaASM, and Assembly101, showing substantial gains over RGB-only baselines and competitive or superior performance versus state-of-the-art methods. The findings demonstrate that long-range textual context and hierarchical cues significantly improve action understanding in ADL settings, with practical implications for assistive technologies and industrial safety, while also highlighting the need for well-aligned hierarchies to avoid noise. The work provides strong baselines and rich ablations that quantify the impact of context, hierarchy, and fusion design on action recognition performance, and it releases code and annotations to support future research.

Abstract

We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.

Paper Structure

This paper contains 24 sections, 6 equations, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Hierarchical annotation example in the TSU dataset. Video frames are annotated with fine-grained action labels, including composite activities (e.g., Make coffee). Each color corresponds to one of the coarse-grained action categories.
  • Figure 2: Overview of the proposed action recognition architecture. From a video input, frozen feature extractors obtain spatial features from RGB frames and temporal motion features from optical flow. These features, along with a class token, are processed through individual video transformer encoders for each visual modality, capturing long-range temporal dependencies. Dashed lines indicate the use of either RGB or flow features and their respective embeddings, though only a single video encoder is depicted for simplicity. Additionally, DistilBERT extracts textual features that represent the current location, previous actions and dataset context. Coarse-grained actions are recognized based on contextual information, specifically utilizing the textual embeddings while the fine-grained classifier leverages the fused features from the fusion transformer.