Large Language Models for Page Stream Segmentation
Hunter Heidenreich, Ratish Dalvi, Rohith Mukku, Nikhil Verma, Neven Pičuljan
TL;DR
The paper tackles Page Stream Segmentation (PSS) by introducing TABME++, a public benchmark with commercial OCR annotations to enable realistic evaluation. It demonstrates that decoder-based large language models (LLMs) fine-tuned with parameter-efficient methods outperform smaller multimodal encoders and traditional baselines, while underscoring the critical impact of OCR quality on segmentation accuracy. By analyzing dataset characteristics and sampling biases, it shows TABME++ aligns more closely with internal data distributions, enhancing realism for PSS research. The work suggests that decoder-focused approaches hold strong practical potential for scalable document processing, while also highlighting avenues for leveraging multimodality and improving OCR robustness in future studies.
Abstract
Page Stream Segmentation (PSS) is an essential prerequisite for automated document processing at scale. However, research progress has been limited by the absence of realistic public benchmarks. This paper works towards addressing this gap by introducing TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods. Our results show that decoder-based LLMs outperform smaller multimodal encoders. Through a review of existing PSS research and datasets, we identify key challenges and advancements in the field. Our findings highlight the key importance of robust OCR, providing valuable insights for the development of more effective document processing systems.
