DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Anni Zou; Wenhao Yu; Hongming Zhang; Kaixin Ma; Deng Cai; Zhuosheng Zhang; Hai Zhao; Dong Yu

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

TL;DR

DocBench introduces a standardized benchmark to evaluate LLM-based document reading systems on raw document files paired with questions, bridging a gap in real-world capabilities. The dataset comprises 229 PDFs and 1,102 QA pairs across five domains and four question types, generated through a mix of GPT-4, GPT-4V, and human annotators, with rigorous quality control. Evaluations cover proprietary LLM-based systems and parse-then-read pipelines using open-source models, revealing notable gaps relative to human performance, especially for multi-modal, metadata, and long-document scenarios. By providing a diverse, real-world testbed and an automatic yet human-aligned evaluation protocol, DocBench aims to standardize comparisons and drive advances in robust, faithful document reading systems.

Abstract

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, multi-modal information understanding and long-context reading. However, no current benchmark exists to evaluate their performance in such scenarios, where a raw file and questions are provided as input, and a corresponding response is expected as output. In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions. We evaluate both proprietary LLM-based systems accessible via web interfaces or APIs, and a parse-then-read pipeline employing open-source LLMs. Our evaluations reveal noticeable gaps between existing LLM-based document reading systems and human performance, underscoring the challenges of developing proficient systems. To summarize, DocBench aims to establish a standardized benchmark for evaluating LLM-based document reading systems under diverse real-world scenarios, thereby guiding future advancements in this research area.

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

TL;DR

Abstract

Paper Structure (24 sections, 7 figures, 7 tables)

This paper contains 24 sections, 7 figures, 7 tables.

Introduction
The DocBench
Dataset Construction
Document Collection
QA-pair Generation
Quality Check
Dataset Statistics
Dataset Analysis
Evaluation Setup
Experiments and Analysis
Experimental Setup
Results and Discussion
Interpreting Multi-modal and Metadata Information
Handling Lengthy Documents
Faithfulness to User-provided Documents
...and 9 more sections

Figures (7)

Figure 1: An example of OpenAI's GPT-4 based document reading system. Unlike standalone LLMs, recent proprietary LLM-based document reading systems employ a carefully designed approach (e.g., file parsing, code execution) to answer user questions related to document contents.
Figure 2: Construction pipeline of DocBench. (a) Document Collection: gathering PDF files from five different domains; (b) QA-pair Generation: creating diverse and comprehensive QA pairs through a combination of LLMs and human effort; (c) Quality Check: ensuring data quality through a multi-step process that includes auto filtering, manual review, and expert curation.
Figure 3: Overview of Questions and Documents: distribution of question token counts (left); distribution of QA pairs per document (middle); distribution of document token counts (right).
Figure 4: Data distribution of DocBench: (a) proportion(%) of various data groups based on four distinct classification criteria; (b) detailed data analysis based on question types.
Figure 5: To address multi-modal questions in DocBench, it is essential to: (i) identify the relevant figure/table (Location); (ii) extract specific data (Extraction); (iii) perform necessary calculations (Calculation). In the first case study, KimiChat fails to locate the figure, Claude-3 retrieves incorrect data, and GPT-4, despite succeeding in the first two steps, struggles with the calculation.
...and 2 more figures

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

TL;DR

Abstract

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (7)