SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
TL;DR
SynFinTabs tackles the lack of labeled financial-table data by generating 100k synthetic financial tables with precise word- and cell-level annotations and multi-format representations. By fine-tuning LayoutLM on an extractive table QA task, the authors introduce FinTabQA, showing that synthetic, layout-aware data improves information extraction from table images and compares favorably with strong baselines under certain conditions. The work highlights the impact of OCR quality on end-to-end systems and provides a scalable pipeline for domain-specific table data generation, with practical implications for finance document processing and broader domain transfer. Overall, the contributions include a synthetic data generator, a high-quality, annotated financial-table corpus, and a demonstrated QA-oriented model, enabling more robust table information extraction without exposing private documents.
Abstract
Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.
