Table of Contents
Fetching ...

SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition

Zechen Li, Shohreh Deldari, Linyao Chen, Hao Xue, Flora D. Salim

TL;DR

<3-5 sentence high-level summary> SensorLLM tackles the challenge of applying large language models to wearable sensor time-series by creating a two-stage alignment: first, Sensor-Language Alignment that converts multivariate sensor trends into human-readable text using a Chronos-based encoder and an alignment MLP with per-channel tokens; second, Task-Aware Tuning that freezes the backbone and trains a lightweight classifier for HAR. This approach enables LLM-driven reasoning over sensor data and demonstrates strong generalization across five HAR datasets, achieving state-of-the-art results on most benchmarks. The work shows that aligning sensor data with intuitive text can unlock robust, scalable sensor-based reasoning, paving the way for Sensor-Text Multimodal LLMs with practical impact. Code and data-generation pipelines are released to support further research in time-series and text alignment for sensors.

Abstract

We introduce SensorLLM, a two-stage framework that enables Large Language Models (LLMs) to perform human activity recognition (HAR) from sensor time-series data. Despite their strong reasoning and generalization capabilities, LLMs remain underutilized for motion sensor data due to the lack of semantic context in time-series, computational constraints, and challenges in processing numerical inputs. SensorLLM addresses these limitations through a Sensor-Language Alignment stage, where the model aligns sensor inputs with trend descriptions. Special tokens are introduced to mark channel boundaries. This alignment enables LLMs to capture numerical variations, channel-specific features, and data of varying durations, without requiring human annotations. In the subsequent Task-Aware Tuning stage, we refine the model for HAR classification, achieving performance that matches or surpasses state-of-the-art methods. Our results demonstrate that SensorLLM evolves into an effective sensor learner, reasoner, and classifier through human-intuitive Sensor-Language Alignment, generalizing across diverse HAR datasets. We believe this work establishes a foundation for future research on time-series and text alignment, paving the way for foundation models in sensor data analysis. Our codes are available at https://github.com/zechenli03/SensorLLM.

SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition

TL;DR

<3-5 sentence high-level summary> SensorLLM tackles the challenge of applying large language models to wearable sensor time-series by creating a two-stage alignment: first, Sensor-Language Alignment that converts multivariate sensor trends into human-readable text using a Chronos-based encoder and an alignment MLP with per-channel tokens; second, Task-Aware Tuning that freezes the backbone and trains a lightweight classifier for HAR. This approach enables LLM-driven reasoning over sensor data and demonstrates strong generalization across five HAR datasets, achieving state-of-the-art results on most benchmarks. The work shows that aligning sensor data with intuitive text can unlock robust, scalable sensor-based reasoning, paving the way for Sensor-Text Multimodal LLMs with practical impact. Code and data-generation pipelines are released to support further research in time-series and text alignment for sensors.

Abstract

We introduce SensorLLM, a two-stage framework that enables Large Language Models (LLMs) to perform human activity recognition (HAR) from sensor time-series data. Despite their strong reasoning and generalization capabilities, LLMs remain underutilized for motion sensor data due to the lack of semantic context in time-series, computational constraints, and challenges in processing numerical inputs. SensorLLM addresses these limitations through a Sensor-Language Alignment stage, where the model aligns sensor inputs with trend descriptions. Special tokens are introduced to mark channel boundaries. This alignment enables LLMs to capture numerical variations, channel-specific features, and data of varying durations, without requiring human annotations. In the subsequent Task-Aware Tuning stage, we refine the model for HAR classification, achieving performance that matches or surpasses state-of-the-art methods. Our results demonstrate that SensorLLM evolves into an effective sensor learner, reasoner, and classifier through human-intuitive Sensor-Language Alignment, generalizing across diverse HAR datasets. We believe this work establishes a foundation for future research on time-series and text alignment, paving the way for foundation models in sensor data analysis. Our codes are available at https://github.com/zechenli03/SensorLLM.

Paper Structure

This paper contains 76 sections, 11 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: SensorLLM can analyze and summarize trends in captured sensor data, facilitating human activity recognition tasks.
  • Figure 2: Our proposed SensorLLM framework: (a) Sensor-Language Alignment Stage, where a generative model aligns sensor readings with automatically generated text; (b) Task-Aware Tuning Stage, where a classification model leverages the aligned modalities to perform HAR.
  • Figure 3: Effect of the number of alignment module layers.
  • Figure 4: Effect of Model Size.