Table of Contents
Fetching ...

Towards Accurate and Efficient Document Analytics with Large Language Models

Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G. Parameswaran, Eugene Wu

TL;DR

ZenDB tackles the challenge of ad-hoc analytics over unstructured, templatized documents by exploiting inherent semantic structures extracted as Semantic Hierarchical Trees (SHTs). It presents a novel data model (SHT Table, DTables, System Tables) and a minimal SQL-like query language that operates over document-derived entities, with a cost-aware query engine that uses tree-based summaries to minimize LLM usage while preserving accuracy. The approach yields substantial cost savings and improved precision/recall relative to LLM-only or RAG baselines across three real-world datasets, demonstrating the practical viability of template-aware document analytics. Overall, ZenDB offers a scalable framework for accurate, efficient SQL querying over large collections of structured documents with minimal upfront labeling.

Abstract

Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.

Towards Accurate and Efficient Document Analytics with Large Language Models

TL;DR

ZenDB tackles the challenge of ad-hoc analytics over unstructured, templatized documents by exploiting inherent semantic structures extracted as Semantic Hierarchical Trees (SHTs). It presents a novel data model (SHT Table, DTables, System Tables) and a minimal SQL-like query language that operates over document-derived entities, with a cost-aware query engine that uses tree-based summaries to minimize LLM usage while preserving accuracy. The approach yields substantial cost savings and improved precision/recall relative to LLM-only or RAG baselines across three real-world datasets, demonstrating the practical viability of template-aware document analytics. Overall, ZenDB offers a scalable framework for accurate, efficient SQL querying over large collections of structured documents with minimal upfront labeling.

Abstract

Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.
Paper Structure (24 sections, 1 theorem, 14 figures, 7 tables, 3 algorithms)

This paper contains 24 sections, 1 theorem, 14 figures, 7 tables, 3 algorithms.

Key Result

theorem 1

If the true SHT for a document $D$ is well-formatted, and if an LLM can correctly identify non-headers, then oracle_gen($D$) outputs the true SHT.

Figures (14)

  • Figure 1: Civic Agenda Document and Semantic Structures.
  • Figure 2: Understanding the differences between ZenDB, LLMs and RAG.
  • Figure 3: Templatized Documents: Scientific Papers, Notice of Violations, Job Descriptions.
  • Figure 4: User Workflow with ZenDB.
  • Figure 5: Creating the Projects Table and Adding Attributes.
  • ...and 9 more figures

Theorems & Definitions (1)

  • theorem 1