Table of Contents
Fetching ...

Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs

Kausar Patherya, Ashutosh Dhekne, Francisco Romero

TL;DR

The paper tackles the challenge of making large-language-models practical for IoT analytics by proposing Flash-Fusion, a three-tier edge-cloud framework that first summarizes high-frequency sensor data on-device, then clusters these summaries in the cloud to create a compact, behavior-based vocabulary, and finally crafts context-rich prompts for an LLM to generate grounded insights. Key innovations include edge-based fixed-window statistics with features like mean, variance, percentiles and normalized acceleration magnitude, offline $k$-means clustering into five driving-behavior categories, and an LLM query engine with intent extraction, prompt construction, contextual grounding, and automated response validation. Quantitative evaluation on a university bus dataset shows a $$: $73.5\%$ data transmission reduction, a $95\%$ latency reduction, and a $98\%$ reduction in token usage and API cost compared to an LLM-only baseline, while preserving factual and geographic grounding. The work demonstrates that summarizing and structuring IoT data before LLM prompting can enable expressive, low-latency, and cost-effective analytics for cross-disciplinary stakeholders, and it opens a public dataset for broader smart-city transit research.

Abstract

Smart cities and pervasive IoT deployments have generated interest in IoT data analysis across transportation and urban planning. At the same time, Large Language Models offer a new interface for exploring IoT data - particularly through natural language. Users today face two key challenges when working with IoT data using LLMs: (1) data collection infrastructure is expensive, producing terabytes of low-level sensor readings that are too granular for direct use, and (2) data analysis is slow, requiring iterative effort and technical expertise. Directly feeding all IoT telemetry to LLMs is impractical due to finite context windows, prohibitive token costs at scale, and non-interactive latencies. What is missing is a system that first parses a user's query to identify the analytical task, then selects the relevant data slices, and finally chooses the right representation before invoking an LLM. We present Flash-Fusion, an end-to-end edge-cloud system that reduces the IoT data collection and analysis burden on users. Two principles guide its design: (1) edge-based statistical summarization (achieving 73.5% data reduction) to address data volume, and (2) cloud-based query planning that clusters behavioral data and assembles context-rich prompts to address data interpretation. We deploy Flash-Fusion on a university bus fleet and evaluate it against a baseline that feeds raw data to a state-of-the-art LLM. Flash-Fusion achieves a 95% latency reduction and 98% decrease in token usage and cost while maintaining high-quality responses. It enables personas across disciplines - safety officers, urban planners, fleet managers, and data scientists - to efficiently iterate over IoT data without the burden of manual query authoring or preprocessing.

Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs

TL;DR

The paper tackles the challenge of making large-language-models practical for IoT analytics by proposing Flash-Fusion, a three-tier edge-cloud framework that first summarizes high-frequency sensor data on-device, then clusters these summaries in the cloud to create a compact, behavior-based vocabulary, and finally crafts context-rich prompts for an LLM to generate grounded insights. Key innovations include edge-based fixed-window statistics with features like mean, variance, percentiles and normalized acceleration magnitude, offline -means clustering into five driving-behavior categories, and an LLM query engine with intent extraction, prompt construction, contextual grounding, and automated response validation. Quantitative evaluation on a university bus dataset shows a $73.5\%95\%98\%$ reduction in token usage and API cost compared to an LLM-only baseline, while preserving factual and geographic grounding. The work demonstrates that summarizing and structuring IoT data before LLM prompting can enable expressive, low-latency, and cost-effective analytics for cross-disciplinary stakeholders, and it opens a public dataset for broader smart-city transit research.

Abstract

Smart cities and pervasive IoT deployments have generated interest in IoT data analysis across transportation and urban planning. At the same time, Large Language Models offer a new interface for exploring IoT data - particularly through natural language. Users today face two key challenges when working with IoT data using LLMs: (1) data collection infrastructure is expensive, producing terabytes of low-level sensor readings that are too granular for direct use, and (2) data analysis is slow, requiring iterative effort and technical expertise. Directly feeding all IoT telemetry to LLMs is impractical due to finite context windows, prohibitive token costs at scale, and non-interactive latencies. What is missing is a system that first parses a user's query to identify the analytical task, then selects the relevant data slices, and finally chooses the right representation before invoking an LLM. We present Flash-Fusion, an end-to-end edge-cloud system that reduces the IoT data collection and analysis burden on users. Two principles guide its design: (1) edge-based statistical summarization (achieving 73.5% data reduction) to address data volume, and (2) cloud-based query planning that clusters behavioral data and assembles context-rich prompts to address data interpretation. We deploy Flash-Fusion on a university bus fleet and evaluate it against a baseline that feeds raw data to a state-of-the-art LLM. Flash-Fusion achieves a 95% latency reduction and 98% decrease in token usage and cost while maintaining high-quality responses. It enables personas across disciplines - safety officers, urban planners, fleet managers, and data scientists - to efficiently iterate over IoT data without the burden of manual query authoring or preprocessing.

Paper Structure

This paper contains 25 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Flash-Fusion: Fast, Grounded Intelligence. Sensor data are aggregated on-device, minimizing transmission overhead. Compact summaries are clustered in the cloud by behavioral type. When a user asks a question, the system analyzes their intent and fuses relevant data into the LLM prompt, yielding responses that are fast, verifiable, and grounded in real data.
  • Figure 2: Flash-Fusion's Three-Tier Architecture. Data flows from the Edge (left), where on-device processing reduces data volume, through the Cloud (center) for scalable storage and machine learning, to the LLM (right) where data is transformed into a format suitable for interactive QA.
  • Figure 3: Flash-Fusion Latency versus LLM Only. Flash-Fusion cuts query latency by 95% compared to the LLM Only baseline across five queries. Its single condensed API call replaces six sequential calls for chunk processing and synthesis. Whiskers show the minimum and maximum latencies over three query runs.
  • Figure 4: Response Comparison between Flash-Fusion & LLM Only. Flash-Fusion produces geographically grounded and actionable responses, while LLM Only returns uninterpretable coordinates, data limitation disclaimers and technical jargon.
  • Figure 5: Separation of Behavioral Clusters. There is a clear distinction between clusters by examining the instability and extreme event magnitude, with a log scale highlighting a progression from calm to aggressive behaviors. The Very Aggressive cluster stands out as a high-impact outlier.