Semantically Encoding Activity Labels for Context-Aware Human Activity Recognition
Wen Ge, Guanyi Mou, Emmanuel O. Agu, Kyumin Lee
TL;DR
CA-HAR is traditionally treated as multi-label classification with binary labels, which discards semantic relations among activities and contexts. SEAL introduces a language-model-based CA-HAR label encoder and a cross-modal alignment framework that maps sensor-time-series data and textual labels into a shared embedding space, enabling similarity-based label inference. The approach preserves semantic relationships among activities and contexts, yielding systematic improvements over state-of-the-art baselines across three real-world CA-HAR datasets, including notable gains on rare and short-term actions. This LM-driven semantic encoding opens avenues for stronger, more interpretable CA-HAR models and potential multi-modal extensions.
Abstract
Prior work has primarily formulated CA-HAR as a multi-label classification problem, where model inputs are time-series sensor data and target labels are binary encodings representing whether a given activity or context occurs. These CA-HAR methods either predicted each label independently or manually imposed relationships using graphs. However, both strategies often neglect an essential aspect: activity labels have rich semantic relationships. For instance, walking, jogging, and running activities share similar movement patterns but differ in pace and intensity, indicating that they are semantically related. Consequently, prior CA-HAR methods often struggled to accurately capture these inherent and nuanced relationships, particularly on datasets with noisy labels typically used for CA-HAR or situations where the ideal sensor type is unavailable (e.g., recognizing speech without audio sensors). To address this limitation, we propose SEAL, which leverage LMs to encode CA-HAR activity labels to capture semantic relationships. LMs generate vector embeddings that preserve rich semantic information from natural language. Our SEAL approach encodes input-time series sensor data from smart devices and their associated activity and context labels (text) as vector embeddings. During training, SEAL aligns the sensor data representations with their corresponding activity/context label embeddings in a shared embedding space. At inference time, SEAL performs a similarity search, returning the CA-HAR label with the embedding representation closest to the input data. Although LMs have been widely explored in other domains, surprisingly, their potential in CA-HAR has been underexplored, making our approach a novel contribution to the field. Our research opens up new possibilities for integrating more advanced LMs into CA-HAR tasks.
