Towards Accurate and Efficient Document Analytics with Large Language Models

Yiming Lin; Madelon Hulsebos; Ruiying Ma; Shreya Shankar; Sepanta Zeigham; Aditya G. Parameswaran; Eugene Wu

Towards Accurate and Efficient Document Analytics with Large Language Models

Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G. Parameswaran, Eugene Wu

TL;DR

ZenDB tackles the challenge of ad-hoc analytics over unstructured, templatized documents by exploiting inherent semantic structures extracted as Semantic Hierarchical Trees (SHTs). It presents a novel data model (SHT Table, DTables, System Tables) and a minimal SQL-like query language that operates over document-derived entities, with a cost-aware query engine that uses tree-based summaries to minimize LLM usage while preserving accuracy. The approach yields substantial cost savings and improved precision/recall relative to LLM-only or RAG baselines across three real-world datasets, demonstrating the practical viability of template-aware document analytics. Overall, ZenDB offers a scalable framework for accurate, efficient SQL querying over large collections of structured documents with minimal upfront labeling.

Abstract

Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.

Towards Accurate and Efficient Document Analytics with Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 1 theorem, 14 figures, 7 tables, 3 algorithms)

This paper contains 24 sections, 1 theorem, 14 figures, 7 tables, 3 algorithms.

Introduction
User Workflow with ZenDB
Semantic Hierarchical Tree
Preliminaries
SHT Construction on a Single Document
SHT Construction across Documents
Data Model and Query Language
Data Model Definition
SHT Table
User-defined DTables
System-Defined Tables
Query Language
Table Population
Query Engine
Logical Query Plan
...and 9 more sections

Key Result

theorem 1

If the true SHT for a document $D$ is well-formatted, and if an LLM can correctly identify non-headers, then oracle_gen($D$) outputs the true SHT.

Figures (14)

Figure 1: Civic Agenda Document and Semantic Structures.
Figure 2: Understanding the differences between ZenDB, LLMs and RAG.
Figure 3: Templatized Documents: Scientific Papers, Notice of Violations, Job Descriptions.
Figure 4: User Workflow with ZenDB.
Figure 5: Creating the Projects Table and Adding Attributes.
...and 9 more figures

Theorems & Definitions (1)

theorem 1

Towards Accurate and Efficient Document Analytics with Large Language Models

TL;DR

Abstract

Towards Accurate and Efficient Document Analytics with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (1)