Table of Contents
Fetching ...

DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain, Spencer Romo, Bob Strahan, Boyi Xie, Diego A. Socolinsky

TL;DR

The first comprehensive benchmark dataset, DocSplit, is presented, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models, and a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains is provided.

Abstract

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

TL;DR

The first comprehensive benchmark dataset, DocSplit, is presented, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models, and a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains is provided.

Abstract

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.
Paper Structure (28 sections, 5 figures, 14 tables)

This paper contains 28 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: The document packet splitting task, where an unordered sequence of pages from multiple documents (shown at top) must be correctly grouped and reordered into their constituent documents (shown at bottom). The colors represent individual documents, while the numbers indicate the original page indices from the top input sequence, demonstrating how pages from the same document can be shuffled and interleaved in the input packet.
  • Figure 2: Test Studio interface with pre-populated benchmark datasets for document processing evaluation.
  • Figure 3: Test Studio evaluation dashboard displaying benchmark results with performance metrics and score distribution.
  • Figure 4: Test Studio cost estimation interface displaying itemized costs across pipeline stages (Classification, Extraction, OCR, Granular Assessment, Summarization) with per-API token and invocation tracking.
  • Figure 5: Custom dataset import interface displaying the required structure with input/ folder for packet PDFs and baseline/ folder containing ground truth annotations in result.json files.