Table of Contents
Fetching ...

Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

Shuheng Li, Jiayun Zhang, Xiaohan Fu, Xiyuan Zhang, Jingbo Shang, Rajesh K. Gupta

TL;DR

SKELAR tackles cross-sensor HAR by anchoring activity representations to physical motion. It pretrains a skeleton encoder using a coarse angle reconstruction objective, then uses a self-attention label-matching module to align these representations with heterogeneous signals from IMU, WiFi, and Kinect data, achieving strong results in both full-shot and few-shot settings. The MASD dataset and synthetic skeleton data demonstrate practical applicability across modalities and unseen labels, signaling robust cross-domain HAR with minimal skeleton data. Overall, SKELAR enables more generalizable, motion-centric HAR across diverse sensing environments and offers scalable augmentation through synthetic data.

Abstract

In human activity recognition (HAR), activity labels have typically been encoded in one-hot format, which has a recent shift towards using textual representations to provide contextual knowledge. Here, we argue that HAR should be anchored to physical motion data, as motion forms the basis of activity and applies effectively across sensing systems, whereas text is inherently limited. We propose SKELAR, a novel HAR framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals. Our method addresses two major challenges: (1) capturing core motion knowledge without context-specific details. We achieve this through a self-supervised coarse angle reconstruction task that recovers joint rotation angles, invariant to both users and deployments; (2) adapting the representations to downstream tasks with varying modalities and focuses. To address this, we introduce a self-attention matching module that dynamically prioritizes relevant body parts in a data-driven manner. Given the lack of corresponding labels in existing skeleton data, we establish MASD, a new HAR dataset with IMU, WiFi, and skeleton, collected from 20 subjects performing 27 activities. This is the first broadly applicable HAR dataset with time-synchronized data across three modalities. Experiments show that SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings. We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.

Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

TL;DR

SKELAR tackles cross-sensor HAR by anchoring activity representations to physical motion. It pretrains a skeleton encoder using a coarse angle reconstruction objective, then uses a self-attention label-matching module to align these representations with heterogeneous signals from IMU, WiFi, and Kinect data, achieving strong results in both full-shot and few-shot settings. The MASD dataset and synthetic skeleton data demonstrate practical applicability across modalities and unseen labels, signaling robust cross-domain HAR with minimal skeleton data. Overall, SKELAR enables more generalizable, motion-centric HAR across diverse sensing environments and offers scalable augmentation through synthetic data.

Abstract

In human activity recognition (HAR), activity labels have typically been encoded in one-hot format, which has a recent shift towards using textual representations to provide contextual knowledge. Here, we argue that HAR should be anchored to physical motion data, as motion forms the basis of activity and applies effectively across sensing systems, whereas text is inherently limited. We propose SKELAR, a novel HAR framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals. Our method addresses two major challenges: (1) capturing core motion knowledge without context-specific details. We achieve this through a self-supervised coarse angle reconstruction task that recovers joint rotation angles, invariant to both users and deployments; (2) adapting the representations to downstream tasks with varying modalities and focuses. To address this, we introduce a self-attention matching module that dynamically prioritizes relevant body parts in a data-driven manner. Given the lack of corresponding labels in existing skeleton data, we establish MASD, a new HAR dataset with IMU, WiFi, and skeleton, collected from 20 subjects performing 27 activities. This is the first broadly applicable HAR dataset with time-synchronized data across three modalities. Experiments show that SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings. We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.

Paper Structure

This paper contains 26 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Skeleton data of two users in a 'sitting' posture. The 3D view displays their original coordinates, while the front and top views show the angles of essential limb joints. Despite the misaligned coordinates, the angles of their upper body joints are highly similar.
  • Figure 2: The pipeline of pretraining activity representations. We adopt an auto-encoder framework with a novel coarse angle reconstruction loss. The temporal-spatial skeleton encoder takes the input of the skeleton time series and learns a representation for each body joint. For training, the decoder picks up an essential joint from the list and recovers the 3D rotation angles using the representations of the chosen joint and its adjacent joints.
  • Figure 3: HAR matching pipeline. SKELAR is compatible with various sensing modalities and backbone models. For each activity, we sample few-shot skeleton data and acquire joint representations using the pretrained encoder. Self-attention is applied to highlight key body parts and the predictions is made by feature matching with dot-product.
  • Figure 4: Sensor deployment site. Activities are performed within the 8ft $\times$ 8ft area marked by the blue tape. Routers on the wall act as the transmitter and the receiver. The volunteer wears a watch and phone with IMU sensors.
  • Figure 5: T-SNE of skeleton activity representation with colors annotating the activities. Results indicate that angle-based reconstruction methods have a better clustering performance.
  • ...and 3 more figures