Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them
Harish Haresamudram, Apoorva Beedu, Mashfiqui Rabbi, Sankalita Saha, Irfan Essa, Thomas Ploetz
TL;DR
The paper interrogates natural language supervision (NLS) for wearable sensor–based HAR, revealing that zero-shot NLS underperforms compared with supervised and self-supervised baselines due to sensor heterogeneity and a paucity of rich activity descriptions. It proposes practical mitigation strategies: adapting projection layers on target data to bridge distribution gaps, and enriching textual activity descriptions with diverse templates and external knowledge from large language models. Through extensive experiments on Capture-24 and six target HAR datasets, the authors show that targeted adaptations yield 30–50% gains, and text diversification can provide additional improvements, including improved zero-shot recognition and cross-modal video retrieval. The work highlights the potential of NLS to extend HAR beyond traditional classification, paving the way for adaptable, cross-modal wearable sensing models and practical search capabilities over videos.
Abstract
Cross-modal contrastive pre-training between natural language and other modalities, e.g., vision and audio, has demonstrated astonishing performance and effectiveness across a diverse variety of tasks and domains. In this paper, we investigate whether such natural language supervision can be used for wearable sensor based Human Activity Recognition (HAR), and discover that-surprisingly-it performs substantially worse than standard end-to-end training and self-supervision. We identify the primary causes for this as: sensor heterogeneity and the lack of rich, diverse text descriptions of activities. To mitigate their impact, we also develop strategies and assess their effectiveness through an extensive experimental evaluation. These strategies lead to significant increases in activity recognition, bringing performance closer to supervised and self-supervised training, while also enabling the recognition of unseen activities and cross modal retrieval of videos. Overall, our work paves the way for better sensor-language learning, ultimately leading to the development of foundational models for HAR using wearables.
