Table of Contents
Fetching ...

Financial Table Extraction in Image Documents

William Watson, Bo Liu

TL;DR

This work tackles the problem of extracting structured tabular data from financial documents rendered as images. It presents an end-to-end pipeline that splits the task into detection (via semantic segmentation with a classifier for empty pages and a 3-class table/separator model), extraction (Tesseract OCR with orientation and noise reduction preprocessing), and alignment (union-find-based column assembly informed by OCR metadata). The approach demonstrates strong detection performance, effective handling of closely spaced tables using separators, and a comprehensive evaluation across multiple alignment models (including LSTM and Transformer variants) with the Local Table LSTM emerging as the top performer. The results indicate practical applicability for converting image-based financial tables into CSV/Excel/LaTeX outputs, enabling indexing, auditing, and integration into downstream financial workflows.

Abstract

Table extraction has long been a pervasive problem in financial services. This is more challenging in the image domain, where content is locked behind cumbersome pixel format. Luckily, advances in deep learning for image segmentation, OCR, and sequence modeling provides the necessary heavy lifting to achieve impressive results. This paper presents an end-to-end pipeline for identifying, extracting and transcribing tabular content in image documents, while retaining the original spatial relations with high fidelity.

Financial Table Extraction in Image Documents

TL;DR

This work tackles the problem of extracting structured tabular data from financial documents rendered as images. It presents an end-to-end pipeline that splits the task into detection (via semantic segmentation with a classifier for empty pages and a 3-class table/separator model), extraction (Tesseract OCR with orientation and noise reduction preprocessing), and alignment (union-find-based column assembly informed by OCR metadata). The approach demonstrates strong detection performance, effective handling of closely spaced tables using separators, and a comprehensive evaluation across multiple alignment models (including LSTM and Transformer variants) with the Local Table LSTM emerging as the top performer. The results indicate practical applicability for converting image-based financial tables into CSV/Excel/LaTeX outputs, enabling indexing, auditing, and integration into downstream financial workflows.

Abstract

Table extraction has long been a pervasive problem in financial services. This is more challenging in the image domain, where content is locked behind cumbersome pixel format. Luckily, advances in deep learning for image segmentation, OCR, and sequence modeling provides the necessary heavy lifting to achieve impressive results. This paper presents an end-to-end pipeline for identifying, extracting and transcribing tabular content in image documents, while retaining the original spatial relations with high fidelity.
Paper Structure (33 sections, 1 equation, 16 figures, 2 tables)

This paper contains 33 sections, 1 equation, 16 figures, 2 tables.

Figures (16)

  • Figure 1: The three steps of tabular extraction from image docs. Top: table identification. Middle: content recovery. Bottom: alignment into tabular format.
  • Figure 2: The separator trick: when tables are close to each other, generate a third class, separator (light green), from annotated table boxes (red).
  • Figure 3: Full Success: Border Lines
  • Figure 4: Full Success: No Border Lines
  • Figure 5: Full Success: Inline Table
  • ...and 11 more figures