Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research

Harish Haresamudram; Hrudhai Rajasekhar; Nikhil Murlidhar Shanbhogue; Thomas Ploetz

Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research

Harish Haresamudram, Hrudhai Rajasekhar, Nikhil Murlidhar Shanbhogue, Thomas Ploetz

TL;DR

This work addresses the risk that Large Language Models (LLMs) have memorized public wearable sensor datasets used for Human Activity Recognition (HAR), potentially invalidating standard benchmark evaluations. It applies a Row Completion memorization test to GPT-4 across five HAR datasets, using 25 trials per file and few-shot prompts to assess whether the model can reproduce next sensor rows, with reproduction quantified by Levenshtein similarity. The study finds evidence of memorization for at least the Daphnet FoG dataset, while data quality and repetitive sensor values in other datasets complicate attribution, underscoring the limitations of current evaluation protocols for LLM-based HAR. The findings highlight the need for non-public datasets, representation-based evaluation, or multi-modal training approaches to ensure reliable HAR performance when leveraging LLMs, and call for revised benchmarking practices to prevent data leakage from contaminating results.

Abstract

The astonishing success of Large Language Models (LLMs) in Natural Language Processing (NLP) has spurred their use in many application domains beyond text analysis, including wearable sensor-based Human Activity Recognition (HAR). In such scenarios, often sensor data are directly fed into an LLM along with text instructions for the model to perform activity classification. Seemingly remarkable results have been reported for such LLM-based HAR systems when they are evaluated on standard benchmarks from the field. Yet, we argue, care has to be taken when evaluating LLM-based HAR systems in such a traditional way. Most contemporary LLMs are trained on virtually the entire (accessible) internet -- potentially including standard HAR datasets. With that, it is not unlikely that LLMs actually had access to the test data used in such benchmark experiments.The resulting contamination of training data would render these experimental evaluations meaningless. In this paper we investigate whether LLMs indeed have had access to standard HAR datasets during training. We apply memorization tests to LLMs, which involves instructing the models to extend given snippets of data. When comparing the LLM-generated output to the original data we found a non-negligible amount of matches which suggests that the LLM under investigation seems to indeed have seen wearable sensor data from the benchmark datasets during training. For the Daphnet dataset in particular, GPT-4 is able to reproduce blocks of sensor readings. We report on our investigations and discuss potential implications on HAR research, especially with regards to reporting results on experimental evaluation

Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research

TL;DR

Abstract

Paper Structure (13 sections, 6 figures, 1 table)

This paper contains 13 sections, 6 figures, 1 table.

Introduction
Background Work and Motivation
LLMs for Time-Series Analysis
LLMs for Human Activity Recognition
Probing LLMs for Memorization
Method
Row completion test
Experimental Settings
Results
Discussion
(Accidental) Finding: Poor Quality of Wearable Sensor Data
Implications for Wearables-Based HAR
Conclusion

Figures (6)

Figure 1: The row completion test: from the text files of sensor data, we randomly sample a few rows. GPT-4 is instructed to predict the next row. This process is repeated randomly 25 times, to get 25 predictions across the sensor file.
Figure 2: Visualizing the prompt used in the row completion test: first, a few examples of successful completion are provided as context. Subsequently, GPT-4 is fed the test prefix rows and instructed to complete the next row.
Figure 3: Row completion test for a file from the Daphnet FoG dataset: here, the values in green are correct, in red are incorrect, and purple are extra predictions by the LLM. We randomly sample predict 25 rows from the file. A large portion of the sensor readings are correctly reproduced, indicating that the LLM has potentially memorized them.
Figure 4: Row completion test for a file from the MHEALTH dataset: here, the values in green are correct, in red are incorrect, and purple are extra predictions by the LLM. We randomly sample predict 25 rows from the file. Interestingly, for specific gyroscopes the values reproduced are accurate to multiple decimal places.
Figure 5: Snippet of data from the Capture-24 dataset: many timesteps of sensor data are identical, and therefore predicting the last row from ten previous rows, results in highly accurate text generation. This also means that GPT-4 can rely more on the available context to predict future rows, rather than reproduce memorized data.
...and 1 more figures

Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research

TL;DR

Abstract

Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research

Authors

TL;DR

Abstract

Table of Contents

Figures (6)