Table of Contents
Fetching ...

Unstructured Data Analysis using LLMs: A Comprehensive Benchmark

Qiyan Deng, Jianhui Li, Chengliang Chai, Jinqi Liu, Junzhi She, Kaisen Jin, Zhaoze Sun, Yuhao Deng, Jia Yuan, Ye Yuan, Guoren Wang, Lei Cao

TL;DR

This work addresses the absence of a large-scale, diverse benchmark for LLM-powered unstructured data analysis (UDA) systems. It introduces UDA-Bench, featuring five extensive datasets, 147 ground-truth attributes, and 240 diverse queries designed to probe data processing, interface flexibility, and multi-layer optimization. The authors implement and evaluate seven representative UDA systems, analyzing accuracy, cost, and latency, and dissect the impact of logical and physical optimization strategies as well as ablation studies on embeddings, chunking, and LLM choices. The benchmark provides a rigorous, reproducible framework that reveals where current approaches excel or falter, and offers concrete directions for improving end-to-end UDA performance and efficiency.

Abstract

Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are developing LLM-powered data systems for users to analyze unstructured documents as working with a database. These unstructured data analysis (UDA) systems differ significantly in all aspects, including query interfaces, query optimization strategies, and operator implementations, making it unclear which performs best in which scenario. Unfortunately, there does not exist a comprehensive benchmark that offers high-quality, large-volume, and diverse datasets as well as rich query workload to thoroughly evaluate such systems. To fill this gap, we present UDA-Bench, the first benchmark for unstructured data analysis that meets all the above requirements. Specifically, we organize a team with 30 graduate students that spends over in total 10,000 hours on curating 5 datasets from various domains and constructing a relational database view from these datasets by manual annotation. These relational databases can be used as ground truth to evaluate any of these UDA systems despite their differences in programming interfaces. Moreover, we design diverse queries to analyze the attributes defined in the database schema, covering different types of analytical operators with varying selectivities and complexities. We conduct in-depth analysis of the key building blocks of existing UDA systems: query interface, query optimization, operator design, and data processing. We run exhaustive experiments over the benchmark to fully evaluate these systems and different techniques w.r.t. the above building blocks.

Unstructured Data Analysis using LLMs: A Comprehensive Benchmark

TL;DR

This work addresses the absence of a large-scale, diverse benchmark for LLM-powered unstructured data analysis (UDA) systems. It introduces UDA-Bench, featuring five extensive datasets, 147 ground-truth attributes, and 240 diverse queries designed to probe data processing, interface flexibility, and multi-layer optimization. The authors implement and evaluate seven representative UDA systems, analyzing accuracy, cost, and latency, and dissect the impact of logical and physical optimization strategies as well as ablation studies on embeddings, chunking, and LLM choices. The benchmark provides a rigorous, reproducible framework that reveals where current approaches excel or falter, and offers concrete directions for improving end-to-end UDA performance and efficiency.

Abstract

Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are developing LLM-powered data systems for users to analyze unstructured documents as working with a database. These unstructured data analysis (UDA) systems differ significantly in all aspects, including query interfaces, query optimization strategies, and operator implementations, making it unclear which performs best in which scenario. Unfortunately, there does not exist a comprehensive benchmark that offers high-quality, large-volume, and diverse datasets as well as rich query workload to thoroughly evaluate such systems. To fill this gap, we present UDA-Bench, the first benchmark for unstructured data analysis that meets all the above requirements. Specifically, we organize a team with 30 graduate students that spends over in total 10,000 hours on curating 5 datasets from various domains and constructing a relational database view from these datasets by manual annotation. These relational databases can be used as ground truth to evaluate any of these UDA systems despite their differences in programming interfaces. Moreover, we design diverse queries to analyze the attributes defined in the database schema, covering different types of analytical operators with varying selectivities and complexities. We conduct in-depth analysis of the key building blocks of existing UDA systems: query interface, query optimization, operator design, and data processing. We run exhaustive experiments over the benchmark to fully evaluate these systems and different techniques w.r.t. the above building blocks.

Paper Structure

This paper contains 21 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Architecture of Unstructured Data Analysis System.
  • Figure 2: F1-score Comparison of Queries with only Extraction.
  • Figure 3: F1-score Comparison of Queries with Filter.
  • Figure 4: Evaluation of Filter Reordering.
  • Figure 5: Evaluation of Filter Pushdown.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Example 1