Table of Contents
Fetching ...

SensorQA: A Question Answering Benchmark for Daily-Life Monitoring

Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimunić Rosing, Larry Heck

TL;DR

SensorQA tackles the challenge of making long-term wearable sensor data accessible via natural language QA. By building a human-created dataset from real-world ExtraSensory data and visualized multi-time-scale activity graphs, it captures diverse user interests and realistic QA scenarios. Benchmark results reveal substantial gaps between state-of-the-art models and practical QA performance, especially for long-duration sensor data and edge-device efficiency, underscoring the need for new sensor-text fusion and deployment-friendly approaches. The dataset and code are openly available to spur advances in real-world, user-centric QA over sensor streams for daily-life monitoring.

Abstract

With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce SensorQA, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. SensorQA is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: https://github.com/benjamin-reichman/SensorQA.

SensorQA: A Question Answering Benchmark for Daily-Life Monitoring

TL;DR

SensorQA tackles the challenge of making long-term wearable sensor data accessible via natural language QA. By building a human-created dataset from real-world ExtraSensory data and visualized multi-time-scale activity graphs, it captures diverse user interests and realistic QA scenarios. Benchmark results reveal substantial gaps between state-of-the-art models and practical QA performance, especially for long-duration sensor data and edge-device efficiency, underscoring the need for new sensor-text fusion and deployment-friendly approaches. The dataset and code are openly available to spur advances in real-world, user-centric QA over sensor streams for daily-life monitoring.

Abstract

With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce SensorQA, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. SensorQA is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: https://github.com/benjamin-reichman/SensorQA.
Paper Structure (10 sections, 4 figures, 4 tables)

This paper contains 10 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Visualizations of existing QA datasets using time series IMU sensor data.
  • Figure 2: Example QA pairs in SensorQA. (a) and (b) are generated from daily graph, while (c) and (d) are generated from multi-day graph.
  • Figure 3: Exact-match accuracy of Llama touvron2023llama displayed by question category (left) and answer category (right).
  • Figure 4: Model memory size (left) and average answer generating latency (right) on Jetson TX2 jetsontx2. Footnote $Q$ denotes models after quantization.