Table of Contents
Fetching ...

Large Language Models for Page Stream Segmentation

Hunter Heidenreich, Ratish Dalvi, Rohith Mukku, Nikhil Verma, Neven Pičuljan

TL;DR

The paper tackles Page Stream Segmentation (PSS) by introducing TABME++, a public benchmark with commercial OCR annotations to enable realistic evaluation. It demonstrates that decoder-based large language models (LLMs) fine-tuned with parameter-efficient methods outperform smaller multimodal encoders and traditional baselines, while underscoring the critical impact of OCR quality on segmentation accuracy. By analyzing dataset characteristics and sampling biases, it shows TABME++ aligns more closely with internal data distributions, enhancing realism for PSS research. The work suggests that decoder-focused approaches hold strong practical potential for scalable document processing, while also highlighting avenues for leveraging multimodality and improving OCR robustness in future studies.

Abstract

Page Stream Segmentation (PSS) is an essential prerequisite for automated document processing at scale. However, research progress has been limited by the absence of realistic public benchmarks. This paper works towards addressing this gap by introducing TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods. Our results show that decoder-based LLMs outperform smaller multimodal encoders. Through a review of existing PSS research and datasets, we identify key challenges and advancements in the field. Our findings highlight the key importance of robust OCR, providing valuable insights for the development of more effective document processing systems.

Large Language Models for Page Stream Segmentation

TL;DR

The paper tackles Page Stream Segmentation (PSS) by introducing TABME++, a public benchmark with commercial OCR annotations to enable realistic evaluation. It demonstrates that decoder-based large language models (LLMs) fine-tuned with parameter-efficient methods outperform smaller multimodal encoders and traditional baselines, while underscoring the critical impact of OCR quality on segmentation accuracy. By analyzing dataset characteristics and sampling biases, it shows TABME++ aligns more closely with internal data distributions, enhancing realism for PSS research. The work suggests that decoder-focused approaches hold strong practical potential for scalable document processing, while also highlighting avenues for leveraging multimodality and improving OCR robustness in future studies.

Abstract

Page Stream Segmentation (PSS) is an essential prerequisite for automated document processing at scale. However, research progress has been limited by the absence of realistic public benchmarks. This paper works towards addressing this gap by introducing TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods. Our results show that decoder-based LLMs outperform smaller multimodal encoders. Through a review of existing PSS research and datasets, we identify key challenges and advancements in the field. Our findings highlight the key importance of robust OCR, providing valuable insights for the development of more effective document processing systems.
Paper Structure (28 sections, 3 equations, 7 figures, 7 tables)

This paper contains 28 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Example page (document ID: ffyw0199, page ID: 7) illustrating the importance of high-quality OCR. The left side shows the page image, while the right side compares the original OCR (top) with the improved OCR (bottom). Text is projected in 2D to preserve layout, which benefits LLM processing wang_layout_2023li_large_2024bayani_testing_2024.
  • Figure 2: Sample efficiency of decoder-based LLMs, demonstrating rapid convergence within the first 1,000 updates. Validation metrics for weights after 5,000 updates are presented, indicating robust performance early in training. The best in each column is highlighted in bold.
  • Figure 3: Kernel density estimate (KDE) of the number of pages per document across datasets. The shift from Tobacco800 to TABME highlights a more pronounced bimodal distribution, aligning with our internal dataset but with generally shorter documents compared to TABME, as reflected in the distribution's right-hand tail.
  • Figure 4: Kernel density estimate (KDE) of the number of pages per stream, showing that TABME features longer streams than those found in our internal dataset.
  • Figure 5: Kernel density estimate (KDE) of the number of documents per stream. TABME tends to have more documents per stream compared to our internal dataset. This suggests a potential need for greater variability in document length when synthetically generating streams or adjusting the Poisson distribution parameters.
  • ...and 2 more figures