Table of Contents
Fetching ...

SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing

Pengrui Quan, Xiaomin Ouyang, Jeya Vikranth Jeyakumar, Ziqi Wang, Yang Xing, Mani Srivastava

TL;DR

This paper introduces SensorBench, a structured benchmark to quantify LLMs' abilities in processing temporal sensor data using DSP tasks across multiple modalities. Through evaluations of four LLMs under API-based coding, non-API coding, and text-based interactions, it shows that LLMs handle simple DSP tasks well but struggle with complex compositional and parameterized problems, often lagging behind human experts. Prompt engineering, particularly self-verification, markedly improves performance in a substantial portion of tasks, while fine-tuning on DSP data yields limited gains, suggesting current models rely more on retrieval than robust planning. The work provides a standardized evaluation framework and actionable insights for developing future LLM-based sensor processing copilots.

Abstract

Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.

SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing

TL;DR

This paper introduces SensorBench, a structured benchmark to quantify LLMs' abilities in processing temporal sensor data using DSP tasks across multiple modalities. Through evaluations of four LLMs under API-based coding, non-API coding, and text-based interactions, it shows that LLMs handle simple DSP tasks well but struggle with complex compositional and parameterized problems, often lagging behind human experts. Prompt engineering, particularly self-verification, markedly improves performance in a substantial portion of tasks, while fine-tuning on DSP data yields limited gains, suggesting current models rely more on retrieval than robust planning. The work provides a standardized evaluation framework and actionable insights for developing future LLM-based sensor processing copilots.

Abstract

Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.

Paper Structure

This paper contains 21 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Sensor processing copilot. We envision an intelligent assistant to support users, making advanced sensor data analysis accessible to broader audiences.
  • Figure 2: Benchmark organization by categories.
  • Figure 3: Coding with APIs. System prompts and user queries are fed to LLMs as instructions. The models can access and process signals through Python, with the help of defined APIs.
  • Figure 4: LLMs v.s. Experts. For the task of using MSE as a metric, we use $1/\text{MSE}$ to ease the comparison. The higher the number in the y-axis, the better the performance. On task (a) - (d), (f), and (h), human experts are substantially better than LLMs.
  • Figure 5: Model output example. The model makes an invalid assumption on stop-band frequency.
  • ...and 6 more figures