Financial Table Extraction in Image Documents

William Watson; Bo Liu

Financial Table Extraction in Image Documents

William Watson, Bo Liu

TL;DR

This work tackles the problem of extracting structured tabular data from financial documents rendered as images. It presents an end-to-end pipeline that splits the task into detection (via semantic segmentation with a classifier for empty pages and a 3-class table/separator model), extraction (Tesseract OCR with orientation and noise reduction preprocessing), and alignment (union-find-based column assembly informed by OCR metadata). The approach demonstrates strong detection performance, effective handling of closely spaced tables using separators, and a comprehensive evaluation across multiple alignment models (including LSTM and Transformer variants) with the Local Table LSTM emerging as the top performer. The results indicate practical applicability for converting image-based financial tables into CSV/Excel/LaTeX outputs, enabling indexing, auditing, and integration into downstream financial workflows.

Abstract

Table extraction has long been a pervasive problem in financial services. This is more challenging in the image domain, where content is locked behind cumbersome pixel format. Luckily, advances in deep learning for image segmentation, OCR, and sequence modeling provides the necessary heavy lifting to achieve impressive results. This paper presents an end-to-end pipeline for identifying, extracting and transcribing tabular content in image documents, while retaining the original spatial relations with high fidelity.

Financial Table Extraction in Image Documents

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 16 figures, 2 tables)

This paper contains 33 sections, 1 equation, 16 figures, 2 tables.

Introduction
Relevant Background
Table Detection
Models
Classification Model
Segmentation Model
Post-processing
Dataset and Labeling
Results
Table Extraction
Tesseract OCR
Preprocessing Pipeline
Automatic Orientation Correction
Morphological Kernel Filtering
Extracted Metadata
...and 18 more sections

Figures (16)

Figure 1: The three steps of tabular extraction from image docs. Top: table identification. Middle: content recovery. Bottom: alignment into tabular format.
Figure 2: The separator trick: when tables are close to each other, generate a third class, separator (light green), from annotated table boxes (red).
Figure 3: Full Success: Border Lines
Figure 4: Full Success: No Border Lines
Figure 5: Full Success: Inline Table
...and 11 more figures

Financial Table Extraction in Image Documents

TL;DR

Abstract

Financial Table Extraction in Image Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (16)