Towards Accurate and Efficient Document Analytics with Large Language Models
Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G. Parameswaran, Eugene Wu
TL;DR
ZenDB tackles the challenge of ad-hoc analytics over unstructured, templatized documents by exploiting inherent semantic structures extracted as Semantic Hierarchical Trees (SHTs). It presents a novel data model (SHT Table, DTables, System Tables) and a minimal SQL-like query language that operates over document-derived entities, with a cost-aware query engine that uses tree-based summaries to minimize LLM usage while preserving accuracy. The approach yields substantial cost savings and improved precision/recall relative to LLM-only or RAG baselines across three real-world datasets, demonstrating the practical viability of template-aware document analytics. Overall, ZenDB offers a scalable framework for accurate, efficient SQL querying over large collections of structured documents with minimal upfront labeling.
Abstract
Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost.
