IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

Zikang Leng; Amitrajit Bhattacharjee; Hrudhai Rajasekhar; Lizhe Zhang; Elizabeth Bruda; Hyeokhyen Kwon; Thomas Plötz

IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

Zikang Leng, Amitrajit Bhattacharjee, Hrudhai Rajasekhar, Lizhe Zhang, Elizabeth Bruda, Hyeokhyen Kwon, Thomas Plötz

TL;DR

The paper tackles data scarcity in sensor-based HAR by extending language-based cross-modality transfer into IMUGPT 2.0, introducing a motion filter and diversity metrics to produce relevant, diverse virtual IMU data efficiently. It demonstrates that GPT-3.5 and T2M-GPT typically yield the best downstream HAR performance, while diversity metrics can cut data-generation effort by about half and a saturation-point algorithm provides a principled stopping rule. The work systematically evaluates across five public HAR datasets and multiple classifiers, analyzes correlations between diversity and performance, and shows notable gains for multi-sensor setups, albeit with limitations linked to motion-synthesis expressivity and subtle motion capture. Overall, IMUGPT 2.0 advances practical, scalable language-driven HAR data augmentation, reducing labeling burdens and enabling broader deployment.

Abstract

One of the primary challenges in the field of human activity recognition (HAR) is the lack of large labeled datasets. This hinders the development of robust and generalizable models. Recently, cross modality transfer approaches have been explored that can alleviate the problem of data scarcity. These approaches convert existing datasets from a source modality, such as video, to a target modality (IMU). With the emergence of generative AI models such as large language models (LLMs) and text-driven motion synthesis models, language has become a promising source data modality as well as shown in proof of concepts such as IMUGPT. In this work, we conduct a large-scale evaluation of language-based cross modality transfer to determine their effectiveness for HAR. Based on this study, we introduce two new extensions for IMUGPT that enhance its use for practical HAR application scenarios: a motion filter capable of filtering out irrelevant motion sequences to ensure the relevance of the generated virtual IMU data, and a set of metrics that measure the diversity of the generated data facilitating the determination of when to stop generating virtual IMU data for both effective and efficient processing. We demonstrate that our diversity metrics can reduce the effort needed for the generation of virtual IMU data by at least 50%, which open up IMUGPT for practical use cases beyond a mere proof of concept.

IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

TL;DR

Abstract

Paper Structure (47 sections, 7 equations, 4 figures, 15 tables, 1 algorithm)

This paper contains 47 sections, 7 equations, 4 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Challenges with HAR using Wearables
Cross Modality Transfer
Text-Driven Motion Generation
IMUGPT
IMUGPT 2.0: Towards Practical Applications of Language-Based Cross Modality Transfer for HAR
Diversity Metrics
Absolute Diversity Metrics
Standard Deviation Metric
Centroid Metric Duda2000
Comparative Diversity
Saturation Point Identification
Filtering out incorrectly generated motion sequences
Motion Captioning
...and 32 more sections

Figures (4)

Figure 1: Overview of Leng et al.'s IMUGPT leng2023generating. ChatGPT is used to generate diverse textual descriptions of the specified activities. Subsequently, a motion synthesis model, T2M-GPT zhang2023generating, generates human motion sequences using the textual descriptions. Virtual IMU data can then be extracted from the generated motion sequences and used for training HAR models. (Figure adopted from Leng et al. leng2023generating and used with permission)
Figure 2: Overview of the proposed motion filter. Using MotionGPT jiang2023motiongpt, we obtain motion captions which are textual descriptions of the input motion sequence. To obtain the motion caption, a language model takes in encoded text and motion tokens and then generate output tokens. The output tokens are decoded to recover the motion caption. Then, we pass the motion caption into a LLM to determine if the motion sequence correctly portrays a specified activity.
Figure 3: The prompts passed to the LLM for it to determine whether the given motion captions accurately describe the specified activity.
Figure 4: Example visualized motion sequences for activities in the RealWorld dataset.

IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

TL;DR

Abstract

IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)