Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context
Manuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez, Jose Garcia-Rodriguez, Antonis Argyros
TL;DR
The paper tackles action recognition by leveraging hierarchical action structures and contextual textual information to capture longer temporal dependencies. It introduces a vision-language Transformer that fuses RGB, optical-flow, and contextual text prompts (including past actions and location) and trains with a joint coarse- and fine-grained loss $L$. A Hierarchical TSU dataset is proposed, extending TSU with two-level hierarchy and contextual labels, and the approach is validated on Hierarchical TSU, IkeaASM, and Assembly101, showing substantial gains over RGB-only baselines and competitive or superior performance versus state-of-the-art methods. The findings demonstrate that long-range textual context and hierarchical cues significantly improve action understanding in ADL settings, with practical implications for assistive technologies and industrial safety, while also highlighting the need for well-aligned hierarchies to avoid noise. The work provides strong baselines and rich ablations that quantify the impact of context, hierarchy, and fusion design on action recognition performance, and it releases code and annotations to support future research.
Abstract
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
