Financial Table Extraction in Image Documents
William Watson, Bo Liu
TL;DR
This work tackles the problem of extracting structured tabular data from financial documents rendered as images. It presents an end-to-end pipeline that splits the task into detection (via semantic segmentation with a classifier for empty pages and a 3-class table/separator model), extraction (Tesseract OCR with orientation and noise reduction preprocessing), and alignment (union-find-based column assembly informed by OCR metadata). The approach demonstrates strong detection performance, effective handling of closely spaced tables using separators, and a comprehensive evaluation across multiple alignment models (including LSTM and Transformer variants) with the Local Table LSTM emerging as the top performer. The results indicate practical applicability for converting image-based financial tables into CSV/Excel/LaTeX outputs, enabling indexing, auditing, and integration into downstream financial workflows.
Abstract
Table extraction has long been a pervasive problem in financial services. This is more challenging in the image domain, where content is locked behind cumbersome pixel format. Luckily, advances in deep learning for image segmentation, OCR, and sequence modeling provides the necessary heavy lifting to achieve impressive results. This paper presents an end-to-end pipeline for identifying, extracting and transcribing tabular content in image documents, while retaining the original spatial relations with high fidelity.
