Table of Contents
Fetching ...

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, Eugene Wu

TL;DR

DocETL presents an agent-driven, declarative system for optimizing complex document processing with LLMs. It introduces a YAML-based DSL, novel rewrite directives, and an optimizer with generation and validation agents that iteratively decompose and evaluate pipelines. Across legal, gaming, declassified, and biomedical tasks, DocETL achieves 21–80% improvements in task-specific accuracy over baselines, demonstrating practical gains even with the non-deterministic nature of LLMs. The work showcases how structured decomposition, contextual augmentation, and rigorous plan validation enable robust, scalable processing of long and heterogeneous documents.

Abstract

Analyzing unstructured data has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered processing of unstructured data. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is (in a single LLM call). This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. For example, an LLM may struggle to identify {\em all} instances of specific clauses, like force majeure or indemnification, in lengthy legal documents, requiring decomposition of the data, the task, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we call rewrite directives), as well as an optimization and evaluation framework. We introduce (i) logical rewriting of pipelines, tailored for LLM-based tasks, (ii) an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and (iii) an optimization algorithm that efficiently finds promising plans, considering the latencies of agent-based plan generation and evaluation. Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis. DocETL is open-source at docetl.org, and as of March 2025, has amassed over 1.7k GitHub Stars, with users spanning a variety of domains.

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

TL;DR

DocETL presents an agent-driven, declarative system for optimizing complex document processing with LLMs. It introduces a YAML-based DSL, novel rewrite directives, and an optimizer with generation and validation agents that iteratively decompose and evaluate pipelines. Across legal, gaming, declassified, and biomedical tasks, DocETL achieves 21–80% improvements in task-specific accuracy over baselines, demonstrating practical gains even with the non-deterministic nature of LLMs. The work showcases how structured decomposition, contextual augmentation, and rigorous plan validation enable robust, scalable processing of long and heterogeneous documents.

Abstract

Analyzing unstructured data has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered processing of unstructured data. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is (in a single LLM call). This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. For example, an LLM may struggle to identify {\em all} instances of specific clauses, like force majeure or indemnification, in lengthy legal documents, requiring decomposition of the data, the task, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we call rewrite directives), as well as an optimization and evaluation framework. We introduce (i) logical rewriting of pipelines, tailored for LLM-based tasks, (ii) an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and (iii) an optimization algorithm that efficiently finds promising plans, considering the latencies of agent-based plan generation and evaluation. Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis. DocETL is open-source at docetl.org, and as of March 2025, has amassed over 1.7k GitHub Stars, with users spanning a variety of domains.

Paper Structure

This paper contains 54 sections, 8 equations, 6 figures, 7 tables, 3 algorithms.

Figures (6)

  • Figure 1: Optimization for a pipeline designed to accomplish the task in \ref{['ex:journalist-task']}. The diagram illustrates the system mid-optimization of the initial map operation. DocETL employs LLMs to synthesize new plans using novel rewrite directives. The process begins with an LLM verifier determining if an operation is sufficiently optimized. If not, rewriting continues. Notably, when a new operation is synthesized as part of a rewrite, it undergoes immediate opportunistic optimization, as shown by the nested "Apply Rewrites (Agent)" rectangles.
  • Figure 2: Reduce's iterative folding over 3 batches of documents. Each batch takes several documents and the current scratchpad as input (left), and updates the mention counts in the scratchpad and accumulated output of entities mentioned multiple times (right).
  • Figure 3: Split-Gather Pipeline: Illustration of processing a single long document. The split operation divides a long document into manageable chunks. The gather operation then augments each chunk with relevant context from peripheral chunks. The image demonstrates three different ways of rendering chunk 3 (i.e., three different gather configurations): (i) including fractional parts of surrounding chunks, (ii) including the full content of the first chunk, and (iii) including summaries of all previous chunks.
  • Figure 4: Gleaning process with $k=1$ round of refinement. An LLM initially extracts information from an input transcript, and Officer Y is missing from the output. A validation agent (LLM-powered) identifies this omission and provides feedback. The original LLM incorporates this feedback in a second pass (shown with purple arrows), resulting in a more complete final output that includes both Officer X and Officer Y.
  • Figure 5: Cost vs. metrics (precision, recall, and F1) for 30 different LLM-generated implementations of rewrite directives applied to the legal contract analysis task. Each point represents a distinct plan implementation, colored by directive type; isolated projections (\ref{['eq:parallelproj']}, chaining projections (\ref{['eq:chainproj']}, or gleaning (\ref{['eq:gleaningrewrite']}). The DocETL unoptimized baseline and optimized plan from \ref{['sec:evaluation-legal']} are shown with dashed lines for reference, though not generated in this experiment. Due to the optimizer's nondeterministic nature, some plans in this experiment achieved higher metrics than the original optimized plan.
  • ...and 1 more figures