Table of Contents
Fetching ...

Semantically Encoding Activity Labels for Context-Aware Human Activity Recognition

Wen Ge, Guanyi Mou, Emmanuel O. Agu, Kyumin Lee

TL;DR

CA-HAR is traditionally treated as multi-label classification with binary labels, which discards semantic relations among activities and contexts. SEAL introduces a language-model-based CA-HAR label encoder and a cross-modal alignment framework that maps sensor-time-series data and textual labels into a shared embedding space, enabling similarity-based label inference. The approach preserves semantic relationships among activities and contexts, yielding systematic improvements over state-of-the-art baselines across three real-world CA-HAR datasets, including notable gains on rare and short-term actions. This LM-driven semantic encoding opens avenues for stronger, more interpretable CA-HAR models and potential multi-modal extensions.

Abstract

Prior work has primarily formulated CA-HAR as a multi-label classification problem, where model inputs are time-series sensor data and target labels are binary encodings representing whether a given activity or context occurs. These CA-HAR methods either predicted each label independently or manually imposed relationships using graphs. However, both strategies often neglect an essential aspect: activity labels have rich semantic relationships. For instance, walking, jogging, and running activities share similar movement patterns but differ in pace and intensity, indicating that they are semantically related. Consequently, prior CA-HAR methods often struggled to accurately capture these inherent and nuanced relationships, particularly on datasets with noisy labels typically used for CA-HAR or situations where the ideal sensor type is unavailable (e.g., recognizing speech without audio sensors). To address this limitation, we propose SEAL, which leverage LMs to encode CA-HAR activity labels to capture semantic relationships. LMs generate vector embeddings that preserve rich semantic information from natural language. Our SEAL approach encodes input-time series sensor data from smart devices and their associated activity and context labels (text) as vector embeddings. During training, SEAL aligns the sensor data representations with their corresponding activity/context label embeddings in a shared embedding space. At inference time, SEAL performs a similarity search, returning the CA-HAR label with the embedding representation closest to the input data. Although LMs have been widely explored in other domains, surprisingly, their potential in CA-HAR has been underexplored, making our approach a novel contribution to the field. Our research opens up new possibilities for integrating more advanced LMs into CA-HAR tasks.

Semantically Encoding Activity Labels for Context-Aware Human Activity Recognition

TL;DR

CA-HAR is traditionally treated as multi-label classification with binary labels, which discards semantic relations among activities and contexts. SEAL introduces a language-model-based CA-HAR label encoder and a cross-modal alignment framework that maps sensor-time-series data and textual labels into a shared embedding space, enabling similarity-based label inference. The approach preserves semantic relationships among activities and contexts, yielding systematic improvements over state-of-the-art baselines across three real-world CA-HAR datasets, including notable gains on rare and short-term actions. This LM-driven semantic encoding opens avenues for stronger, more interpretable CA-HAR models and potential multi-modal extensions.

Abstract

Prior work has primarily formulated CA-HAR as a multi-label classification problem, where model inputs are time-series sensor data and target labels are binary encodings representing whether a given activity or context occurs. These CA-HAR methods either predicted each label independently or manually imposed relationships using graphs. However, both strategies often neglect an essential aspect: activity labels have rich semantic relationships. For instance, walking, jogging, and running activities share similar movement patterns but differ in pace and intensity, indicating that they are semantically related. Consequently, prior CA-HAR methods often struggled to accurately capture these inherent and nuanced relationships, particularly on datasets with noisy labels typically used for CA-HAR or situations where the ideal sensor type is unavailable (e.g., recognizing speech without audio sensors). To address this limitation, we propose SEAL, which leverage LMs to encode CA-HAR activity labels to capture semantic relationships. LMs generate vector embeddings that preserve rich semantic information from natural language. Our SEAL approach encodes input-time series sensor data from smart devices and their associated activity and context labels (text) as vector embeddings. During training, SEAL aligns the sensor data representations with their corresponding activity/context label embeddings in a shared embedding space. At inference time, SEAL performs a similarity search, returning the CA-HAR label with the embedding representation closest to the input data. Although LMs have been widely explored in other domains, surprisingly, their potential in CA-HAR has been underexplored, making our approach a novel contribution to the field. Our research opens up new possibilities for integrating more advanced LMs into CA-HAR tasks.

Paper Structure

This paper contains 25 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of traditional Machine Learning approach v.s. our multi-modality alignment approach. Traditional approaches directly map labels into binary values. In contrast, our approach leverages the language model to encode the semantic relationship between context and activities within high-dimensional vector representations and leads to better performance.
  • Figure 2: Comparison of other HAR models using Language Models v.s. our design. While other approaches mainly use LMs as auxiliary components that provide guidance without directly participating in the decision-making process, our approach integrates LM as a primary contributor, allowing its active involvement in the activity recognition task.
  • Figure 3: The SEAL framework consists of three main components: a Sensor Data Encoder, a CA-HAR Label Encoder, and a Modal Alignment. The Sensor Data Encoder transforms input sensor data into vector embedding representations, while the Label Encoder generates semantic label vector embedding representations from tokenized label sentences. Finally, the Modal Alignment component aligns the sensor data and CA-HAR label representations by maximizing their similarity, enabling SEAL to make accurate predictions. "Trm" are transformers modules within language models.
  • Figure 4: Result of SEAL using different backbones. We observe that SEAL can show improvement with all backbones across all datasets.
  • Figure 5: UMAP visualization of SEAL learned text embedding across all datasets, with clusters generated by KMeans. The clusters indicate SEAL's ability to capture the semantic relationship between activities and contexts.