Table of Contents
Fetching ...

LLM-Driven Online Aggregation for Unstructured Text Analytics

Chao Hui, Weizheng Lu, Yanjie Gao, Lingfeng Xiong, Yunhai Wang, Yueguo Chen

TL;DR

This work proposes OLLA, an LLM-driven online aggregation framework that accelerates semantic processing within relational queries, and introduces a semantic stratified sampling approach that improves data selection and expedites convergence to the ground truth.

Abstract

Large Language Models (LLMs) exhibit strong capabilities in text processing, and recent research has augmented SQL and DataFrame with LLM-powered semantic operators for data analysis. However, LLM-based data processing is hindered by slower token generation speeds compared to relational queries. To enhance real-time responsiveness, we propose OLLA, an LLM-driven online aggregation framework that accelerates semantic processing within relational queries. In contrast to batch-processing systems that yield results only after the entire dataset is processed, our approach incrementally transforms text into a structured data stream and applies online aggregation to provide progressive output. To enhance our online aggregation process, we introduce a semantic stratified sampling approach that improves data selection and expedites convergence to the ground truth. Evaluations show that OLLA reaches 1% accuracy error bound compared with labeled ground truth using less than 4% of the full-data time. It achieves speedups ranging from 1.6$\times$ to 38$\times$ across diverse domains, measured by comparing the time to reach a 5% error bound with that of full-data time. We release our code at https://github.com/olla-project/llm-online-agg.git.

LLM-Driven Online Aggregation for Unstructured Text Analytics

TL;DR

This work proposes OLLA, an LLM-driven online aggregation framework that accelerates semantic processing within relational queries, and introduces a semantic stratified sampling approach that improves data selection and expedites convergence to the ground truth.

Abstract

Large Language Models (LLMs) exhibit strong capabilities in text processing, and recent research has augmented SQL and DataFrame with LLM-powered semantic operators for data analysis. However, LLM-based data processing is hindered by slower token generation speeds compared to relational queries. To enhance real-time responsiveness, we propose OLLA, an LLM-driven online aggregation framework that accelerates semantic processing within relational queries. In contrast to batch-processing systems that yield results only after the entire dataset is processed, our approach incrementally transforms text into a structured data stream and applies online aggregation to provide progressive output. To enhance our online aggregation process, we introduce a semantic stratified sampling approach that improves data selection and expedites convergence to the ground truth. Evaluations show that OLLA reaches 1% accuracy error bound compared with labeled ground truth using less than 4% of the full-data time. It achieves speedups ranging from 1.6 to 38 across diverse domains, measured by comparing the time to reach a 5% error bound with that of full-data time. We release our code at https://github.com/olla-project/llm-online-agg.git.
Paper Structure (23 sections, 11 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 11 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: OLLA system architecture: unstructured data is processed by the LLM module to produce streaming structured data, which is then incrementally aggregated by the online aggregation engine.
  • Figure 2: Three types of user queries.
  • Figure 3: The adjustment process that enforces homogeneity within each stratum and heterogeneity across strata.
  • Figure 4: Convergence of accuracy over time for online aggregation. Accuracy: the absolute error between the streaming aggregate result and the ground truth. Left: average like_count of the BBC News. Right: average view_count of the BBC News.
  • Figure 5: Convergence of confidence intervals for representative query types. The red percentage is the time our method reaches a 5% error bound, divided by the total batch execution time. From left to right: SELECT: Average age of resume on Chinese Resume (CR); SELECT: Average total_price of type Invoices on Company Documents (CD); WHERE: Average length of positive Movie reviews; GROUP BY: Proportion of Neutral Movie reviews.
  • ...and 6 more figures