Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs

Kausar Patherya; Ashutosh Dhekne; Francisco Romero

Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs

Kausar Patherya, Ashutosh Dhekne, Francisco Romero

TL;DR

The paper tackles the challenge of making large-language-models practical for IoT analytics by proposing Flash-Fusion, a three-tier edge-cloud framework that first summarizes high-frequency sensor data on-device, then clusters these summaries in the cloud to create a compact, behavior-based vocabulary, and finally crafts context-rich prompts for an LLM to generate grounded insights. Key innovations include edge-based fixed-window statistics with features like mean, variance, percentiles and normalized acceleration magnitude, offline $k$-means clustering into five driving-behavior categories, and an LLM query engine with intent extraction, prompt construction, contextual grounding, and automated response validation. Quantitative evaluation on a university bus dataset shows a $$: $73.5\%$ data transmission reduction, a $95\%$ latency reduction, and a $98\%$ reduction in token usage and API cost compared to an LLM-only baseline, while preserving factual and geographic grounding. The work demonstrates that summarizing and structuring IoT data before LLM prompting can enable expressive, low-latency, and cost-effective analytics for cross-disciplinary stakeholders, and it opens a public dataset for broader smart-city transit research.

Abstract

Smart cities and pervasive IoT deployments have generated interest in IoT data analysis across transportation and urban planning. At the same time, Large Language Models offer a new interface for exploring IoT data - particularly through natural language. Users today face two key challenges when working with IoT data using LLMs: (1) data collection infrastructure is expensive, producing terabytes of low-level sensor readings that are too granular for direct use, and (2) data analysis is slow, requiring iterative effort and technical expertise. Directly feeding all IoT telemetry to LLMs is impractical due to finite context windows, prohibitive token costs at scale, and non-interactive latencies. What is missing is a system that first parses a user's query to identify the analytical task, then selects the relevant data slices, and finally chooses the right representation before invoking an LLM. We present Flash-Fusion, an end-to-end edge-cloud system that reduces the IoT data collection and analysis burden on users. Two principles guide its design: (1) edge-based statistical summarization (achieving 73.5% data reduction) to address data volume, and (2) cloud-based query planning that clusters behavioral data and assembles context-rich prompts to address data interpretation. We deploy Flash-Fusion on a university bus fleet and evaluate it against a baseline that feeds raw data to a state-of-the-art LLM. Flash-Fusion achieves a 95% latency reduction and 98% decrease in token usage and cost while maintaining high-quality responses. It enables personas across disciplines - safety officers, urban planners, fleet managers, and data scientists - to efficiently iterate over IoT data without the burden of manual query authoring or preprocessing.

Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs

TL;DR

Abstract

Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)