Large Language Model-Guided Semantic Alignment for Human Activity Recognition

Hua Yan; Heng Tan; Yi Ding; Pengfei Zhou; Vinod Namboodiri; Yu Yang

Large Language Model-Guided Semantic Alignment for Human Activity Recognition

Hua Yan, Heng Tan, Yi Ding, Pengfei Zhou, Vinod Namboodiri, Yu Yang

TL;DR

Cross-dataset HAR suffers from distribution gaps and the emergence of unseen activities. The authors introduce LanHAR, which leverages Large Language Models to generate semantic interpretations of both sensor readings and activity labels, then aligns these through a two-stage training pipeline with a lightweight sensor encoder for on-device HAR. A text encoder with contrastive and reconstruction objectives, plus a Transformer-based sensor encoder, maps IMU data into a language-space representation, enabling cross-dataset generalization and zero-shot recognition of new activities. Across five public HAR datasets, LanHAR consistently outperforms state-of-the-art methods in cross-dataset and new-activity settings, achieving up to a $7.35 ext{ extvisiblespace} ext{percentage-point}$ gain in accuracy and a $13.16 ext{ extvisiblespace} ext{point}$ gain in F1, with notable improvements in new-activity accuracy ($43.67 ext{ extvisiblespace} ext{percent}$). The approach also supports privacy-preserving mobile deployment and offers a flexible framework to incorporate stronger LLMs and physics-informed semantics in future work.

Abstract

Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is critical for applications in healthcare, safety, and industrial production. However, variations in activity patterns, device types, and sensor placements create distribution gaps across datasets, reducing the performance of HAR models. To address this, we propose LanHAR, a novel system that leverages Large Language Models (LLMs) to generate semantic interpretations of sensor readings and activity labels for cross-dataset HAR. This approach not only mitigates cross-dataset heterogeneity but also enhances the recognition of new activities. LanHAR employs an iterative re-generation method to produce high-quality semantic interpretations with LLMs and a two-stage training framework that bridges the semantic interpretations of sensor readings and activity labels. This ultimately leads to a lightweight sensor encoder suitable for mobile deployment, enabling any sensor reading to be mapped into the semantic interpretation space. Experiments on five public datasets demonstrate that our approach significantly outperforms state-of-the-art methods in both cross-dataset HAR and new activity recognition. The source code is publicly available at https://github.com/DASHLab/LanHAR.

Large Language Model-Guided Semantic Alignment for Human Activity Recognition

TL;DR

gain in accuracy and a

gain in F1, with notable improvements in new-activity accuracy (

). The approach also supports privacy-preserving mobile deployment and offers a flexible framework to incorporate stronger LLMs and physics-informed semantics in future work.

Abstract

Paper Structure (49 sections, 8 equations, 18 figures, 5 tables)

This paper contains 49 sections, 8 equations, 18 figures, 5 tables.

Introduction
Related work
Human activity recognition
LLM for human activity recognition
Motivation
LLMs possess the ability to perceive the physical world
Semantic interpretations of sensor readings for cross-dataset HAR
New activity recognition
System design
Problem formulation
Design overview
LLMs for semantic interpretations
Obtaining semantics interpretation of sensor reading
Obtaining semantics interpretation of activity label
An iterative re-generation method to ensuring the quality of LLM responses
...and 34 more sections

Figures (18)

Figure 1: Semantic interpretations for HAR
Figure 2: Example of semantic interpretations of sensor readings and activity labels
Figure 3: Data distribution under three settings across two datasets
Figure 4: Semantic interpretations for new activities
Figure 5: Overview of LanHAR. (1) Utilize LLMs for generating semantic interpretations of sensor reading and activity labels. (2) Train a text encoder to encode two types of semantic interpretations and achieve their alignment. $H{i}$ and $Z_{i}$ denote embeddings of semantic interpretations of activity labels and sensor reading. (3) Train a sensor encoder to align sensor reading and semantic interpretation. $E{i}$ denote embeddings of sensor reading. (4) For inference, only use the sensor encoder to generate embeddings for the sensor readings $E{i}$ and then compute the similarity with the pre-stored embeddings of the activity labels $H{i}$ to obtain the human activity recognition results.
...and 13 more figures

Large Language Model-Guided Semantic Alignment for Human Activity Recognition

TL;DR

Abstract

Large Language Model-Guided Semantic Alignment for Human Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (18)