Table of Contents
Fetching ...

Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding

Lilin Xu, Kaiyuan Hou, Xiaofan Jiang

TL;DR

This work investigates the ability of large language models to perform fine grained IMU based human activity recognition, focusing on mid air letter recognition. It shows zero shot LLMs perform near random and that 2D letter recognition can be dramatically improved by fine tuning LLMs with instruction answer pairs using LoRA, achieving up to a 129x gain for 2D data. To address 3D mid air data, the authors introduce a deep metric learning based dimensional reduction pipeline that maps 3D IMU trajectories to 2D equivalents, reporting a 93.08% mapping accuracy and enabling effective 2D based letter recognition with LLMs. An end to end word recognition pipeline demonstrates ~78% accuracy for words up to 5 letters, indicating practical viability for fine grained HAR in AR/VR contexts. Overall, the work highlights the potential of LLMs for fine grained IMU based HAR and provides concrete techniques for data generation, model adaptation, and 3D to 2D representation, with implications for robust gesture and letter recognition in real world deployments.

Abstract

Human activity recognition (HAR) using inertial measurement units (IMUs) increasingly leverages large language models (LLMs), yet existing approaches focus on coarse activities like walking or running. Our preliminary study indicates that pretrained LLMs fail catastrophically on fine-grained HAR tasks such as air-written letter recognition, achieving only near-random guessing accuracy. In this work, we first bridge this gap for flat-surface writing scenarios: by fine-tuning LLMs with a self-collected dataset and few-shot learning, we achieved up to a 129x improvement on 2D data. To extend this to 3D scenarios, we designed an encoder-based pipeline that maps 3D data into 2D equivalents, preserving the spatiotemporal information for robust letter prediction. Our end-to-end pipeline achieves 78% accuracy on word recognition with up to 5 letters in mid-air writing scenarios, establishing LLMs as viable tools for fine-grained HAR.

Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding

TL;DR

This work investigates the ability of large language models to perform fine grained IMU based human activity recognition, focusing on mid air letter recognition. It shows zero shot LLMs perform near random and that 2D letter recognition can be dramatically improved by fine tuning LLMs with instruction answer pairs using LoRA, achieving up to a 129x gain for 2D data. To address 3D mid air data, the authors introduce a deep metric learning based dimensional reduction pipeline that maps 3D IMU trajectories to 2D equivalents, reporting a 93.08% mapping accuracy and enabling effective 2D based letter recognition with LLMs. An end to end word recognition pipeline demonstrates ~78% accuracy for words up to 5 letters, indicating practical viability for fine grained HAR in AR/VR contexts. Overall, the work highlights the potential of LLMs for fine grained IMU based HAR and provides concrete techniques for data generation, model adaptation, and 3D to 2D representation, with implications for robust gesture and letter recognition in real world deployments.

Abstract

Human activity recognition (HAR) using inertial measurement units (IMUs) increasingly leverages large language models (LLMs), yet existing approaches focus on coarse activities like walking or running. Our preliminary study indicates that pretrained LLMs fail catastrophically on fine-grained HAR tasks such as air-written letter recognition, achieving only near-random guessing accuracy. In this work, we first bridge this gap for flat-surface writing scenarios: by fine-tuning LLMs with a self-collected dataset and few-shot learning, we achieved up to a 129x improvement on 2D data. To extend this to 3D scenarios, we designed an encoder-based pipeline that maps 3D data into 2D equivalents, preserving the spatiotemporal information for robust letter prediction. Our end-to-end pipeline achieves 78% accuracy on word recognition with up to 5 letters in mid-air writing scenarios, establishing LLMs as viable tools for fine-grained HAR.

Paper Structure

This paper contains 16 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Potential application scenarios of LLM-based fine-grained human activity understanding. Compared with small models, LLMs can enhance various context-based applications with their strong contextual understanding and generalization capabilities.
  • Figure 2: The data collection setup and process.
  • Figure 3: Visualization of collected IMU data (2D).
  • Figure 4: Prompt templates. The additional context specific to the few-shot template is highlighted in gray.
  • Figure 5: Examples of LLaMA-3-8B models' answers.
  • ...and 3 more figures