Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them

Harish Haresamudram; Apoorva Beedu; Mashfiqui Rabbi; Sankalita Saha; Irfan Essa; Thomas Ploetz

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them

Harish Haresamudram, Apoorva Beedu, Mashfiqui Rabbi, Sankalita Saha, Irfan Essa, Thomas Ploetz

TL;DR

The paper interrogates natural language supervision (NLS) for wearable sensor–based HAR, revealing that zero-shot NLS underperforms compared with supervised and self-supervised baselines due to sensor heterogeneity and a paucity of rich activity descriptions. It proposes practical mitigation strategies: adapting projection layers on target data to bridge distribution gaps, and enriching textual activity descriptions with diverse templates and external knowledge from large language models. Through extensive experiments on Capture-24 and six target HAR datasets, the authors show that targeted adaptations yield 30–50% gains, and text diversification can provide additional improvements, including improved zero-shot recognition and cross-modal video retrieval. The work highlights the potential of NLS to extend HAR beyond traditional classification, paving the way for adaptable, cross-modal wearable sensing models and practical search capabilities over videos.

Abstract

Cross-modal contrastive pre-training between natural language and other modalities, e.g., vision and audio, has demonstrated astonishing performance and effectiveness across a diverse variety of tasks and domains. In this paper, we investigate whether such natural language supervision can be used for wearable sensor based Human Activity Recognition (HAR), and discover that-surprisingly-it performs substantially worse than standard end-to-end training and self-supervision. We identify the primary causes for this as: sensor heterogeneity and the lack of rich, diverse text descriptions of activities. To mitigate their impact, we also develop strategies and assess their effectiveness through an extensive experimental evaluation. These strategies lead to significant increases in activity recognition, bringing performance closer to supervised and self-supervised training, while also enabling the recognition of unseen activities and cross modal retrieval of videos. Overall, our work paves the way for better sensor-language learning, ultimately leading to the development of foundational models for HAR using wearables.

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 12 figures, 15 tables)

This paper contains 40 sections, 2 equations, 12 figures, 15 tables.

Introduction
Related Work
Natural Language Supervision for HAR
Cross-Modal Contrastive Pre-Training
HAR Through Text-Based Classification
Experimental Settings
Standard HAR setup
Datasets
Sampling Rate and Segmentation
Plug-and-Play NLS for HAR
Human Activity Recognition Experiments
Baselines
Results
Challenges
Tackling NLS for HAR Challenges
...and 25 more sections

Figures (12)

Figure 1: Difference in performance between supervised and self-supervised training and natural language supervision.
Figure 2: Natural language supervision for sensor-based HAR: the network is pre-trained by learning to accurately match windows of sensor data to the corresponding ground truth activities in form of textual descriptions. HAR is then performed by computing cosine similarity scores between windows of test sensor data and all activity sentences. The sentence with highest similarity score determines the final activity output (lower right part in phase 2). This figure is inspired by radford2021learning.
Figure 3: Adapting projection layers increases HAR performance of sensor-based NLS by 20-40%.
Figure 4: Adaptation on target data: access to even small quantities of target data ($<$2 min) substantially improves performance. Full figure in the Appendix (Fig. \ref{['fig:adaptation_few_shot']}).
Figure 5: Evaluating cross modal retrieval capabilities: For four of the six activities from Motionsense, correct videos are retrieved among top-5 matches. 'GT' is the ground truth label from RealWorld, whereas 'XCLIP' comprises predictions from the pre-trained X-CLIP model. Full figure in the Appendix (see Fig. \ref{['fig:cross_modal_retrieval']})
...and 7 more figures

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them

TL;DR

Abstract

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them

Authors

TL;DR

Abstract

Table of Contents

Figures (12)