Large Language Model-Guided Semantic Alignment for Human Activity Recognition
Hua Yan, Heng Tan, Yi Ding, Pengfei Zhou, Vinod Namboodiri, Yu Yang
TL;DR
Cross-dataset HAR suffers from distribution gaps and the emergence of unseen activities. The authors introduce LanHAR, which leverages Large Language Models to generate semantic interpretations of both sensor readings and activity labels, then aligns these through a two-stage training pipeline with a lightweight sensor encoder for on-device HAR. A text encoder with contrastive and reconstruction objectives, plus a Transformer-based sensor encoder, maps IMU data into a language-space representation, enabling cross-dataset generalization and zero-shot recognition of new activities. Across five public HAR datasets, LanHAR consistently outperforms state-of-the-art methods in cross-dataset and new-activity settings, achieving up to a $7.35 ext{ extvisiblespace} ext{percentage-point}$ gain in accuracy and a $13.16 ext{ extvisiblespace} ext{point}$ gain in F1, with notable improvements in new-activity accuracy ($43.67 ext{ extvisiblespace} ext{percent}$). The approach also supports privacy-preserving mobile deployment and offers a flexible framework to incorporate stronger LLMs and physics-informed semantics in future work.
Abstract
Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is critical for applications in healthcare, safety, and industrial production. However, variations in activity patterns, device types, and sensor placements create distribution gaps across datasets, reducing the performance of HAR models. To address this, we propose LanHAR, a novel system that leverages Large Language Models (LLMs) to generate semantic interpretations of sensor readings and activity labels for cross-dataset HAR. This approach not only mitigates cross-dataset heterogeneity but also enhances the recognition of new activities. LanHAR employs an iterative re-generation method to produce high-quality semantic interpretations with LLMs and a two-stage training framework that bridges the semantic interpretations of sensor readings and activity labels. This ultimately leads to a lightweight sensor encoder suitable for mobile deployment, enabling any sensor reading to be mapped into the semantic interpretation space. Experiments on five public datasets demonstrate that our approach significantly outperforms state-of-the-art methods in both cross-dataset HAR and new activity recognition. The source code is publicly available at https://github.com/DASHLab/LanHAR.
