Table of Contents
Fetching ...

LS-HAR: Language Supervised Human Action Recognition with Salient Fusion, Construction Sites as a Use-Case

Mohammad Mahdavian, Mohammad Loni, Ted Samuelsson, Mo Chen

TL;DR

LS-HAR tackles robust HAR by fusing skeleton and video data under language supervision. It introduces a language-model-guided skeleton encoder via learnable prompts conditioned on skeleton features and a salient fusion module to emphasize informative frames and joints, trained with a joint objective that combines a contrastive term and a classification term, $L_{total} = \lambda L_{Tcont} + L_{cls}$. A new VolvoConstAct dataset for construction-site robotic instruction is presented to evaluate real-world scenarios. Results on NTU-RGB+D, NTU-RGB+D120, and NW-UCLA show competitive performance, while VolvoConstAct demonstrates strong real-world applicability, highlighting the approach’s robustness and practicality for autonomous construction machines. The work provides end-to-end trainable prompts and a transformer-based fusion strategy that can operate efficiently on edge devices.

Abstract

Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) using language supervision named LS-HAR based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in real-world construction sites. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The code, dataset, and demonstration of real-machine experiments are available at: https://mmahdavian.github.io/ls_har/

LS-HAR: Language Supervised Human Action Recognition with Salient Fusion, Construction Sites as a Use-Case

TL;DR

LS-HAR tackles robust HAR by fusing skeleton and video data under language supervision. It introduces a language-model-guided skeleton encoder via learnable prompts conditioned on skeleton features and a salient fusion module to emphasize informative frames and joints, trained with a joint objective that combines a contrastive term and a classification term, . A new VolvoConstAct dataset for construction-site robotic instruction is presented to evaluate real-world scenarios. Results on NTU-RGB+D, NTU-RGB+D120, and NW-UCLA show competitive performance, while VolvoConstAct demonstrates strong real-world applicability, highlighting the approach’s robustness and practicality for autonomous construction machines. The work provides end-to-end trainable prompts and a transformer-based fusion strategy that can operate efficiently on edge devices.

Abstract

Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) using language supervision named LS-HAR based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in real-world construction sites. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The code, dataset, and demonstration of real-machine experiments are available at: https://mmahdavian.github.io/ls_har/
Paper Structure (17 sections, 5 equations, 3 figures, 4 tables)

This paper contains 17 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A construction worker providing instructions to the Volvo autonomous wheel loader, demonstrating the integration of manual guidance and automated processes for higher operational efficiency and safety.
  • Figure 2: (a) Our model consists of text, skeleton, and image encoders, alongside a temporal visual module and a fusion module to integrate skeleton features (SF) and visual features (VF). During training, the text encoder uses learnable prompts (LP) to guide the skeleton encoder. A meta-net encodes skeleton features and adds them to the LPs (formatted as $V_1V_2...[CLASS]...V_M$, where each $V_M$ is a 512D vector) for different limbs (global, head, hands, hip, legs) to support text feature (TF) extraction. During testing, the text encoder is removed, and the model uses only RGB images and skeleton data. (b) Our fusion module combines skeleton and visual features by emphasizing the most relevant aspects of each modality. The fine-grained down-sampling module is applied similarly to both, but skeleton features require two layers, while visual features only need one.
  • Figure 3: A sample of the image frames and skeleton joints picked by the fine-grained down-sampling module showing the most informative data for detecting the human action.