LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Ahmed Masry; Amir Hajian

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Ahmed Masry, Amir Hajian

TL;DR

LongFin tackles the gap between industrially relevant long financial documents and existing document AI, which are typically restricted to short contexts. It introduces LongFin, a multimodal model with text and layout encoders and a BiACM bridge, augmented by sliding-window local attention and interval-based global attention to process up to $4096$ tokens, alongside the LongForms dataset for long-context NER across SEC Form 10-Qs. Pretraining on large OCR-annotated corpora and evaluation against public baselines show that LongFin attains strong performance on LongForms while maintaining competitive results on single-page benchmarks. This work advances practical finance-domain document understanding and paves the way for multi-language extension and deployment in real-world enterprise settings.

Abstract

Document AI is a growing research field that focuses on the comprehension and extraction of information from scanned and digital documents to make everyday business operations more efficient. Numerous downstream tasks and datasets have been introduced to facilitate the training of AI models capable of parsing and extracting information from various document types such as receipts and scanned forms. Despite these advancements, both existing datasets and models fail to address critical challenges that arise in industrial contexts. Existing datasets primarily comprise short documents consisting of a single page, while existing models are constrained by a limited maximum length, often set at 512 tokens. Consequently, the practical application of these methods in financial services, where documents can span multiple pages, is severely impeded. To overcome these challenges, we introduce LongFin, a multimodal document AI model capable of encoding up to 4K tokens. We also propose the LongForms dataset, a comprehensive financial dataset that encapsulates several industrial challenges in financial documents. Through an extensive evaluation, we demonstrate the effectiveness of the LongFin model on the LongForms dataset, surpassing the performance of existing public models while maintaining comparable results on existing single-page benchmarks.

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

TL;DR

tokens, alongside the LongForms dataset for long-context NER across SEC Form 10-Qs. Pretraining on large OCR-annotated corpora and evaluation against public baselines show that LongFin attains strong performance on LongForms while maintaining competitive results on single-page benchmarks. This work advances practical finance-domain document understanding and paves the way for multi-language extension and deployment in real-world enterprise settings.

Abstract

Paper Structure (22 sections, 4 figures, 5 tables)

This paper contains 22 sections, 4 figures, 5 tables.

Introduction
Related Work
Document Datasets
Document AI Models
LongForms Dataset
Dataset Collection & Preparation
Dataset Description & Setup
LongFin Model
Architecture
Text Encoder
Layout Encoder
BiACM
Pretraining
Experiments & Evaluation
Tasks & Datasets
...and 7 more sections

Figures (4)

Figure 1: First page from a 4-page example financial form in the LongForms dataset. The information in these documents is spread over a mix of tables and text spanning multiple pages which makes it challenging for short-context models.
Figure 2: (a) The architecture of the LongFin model. It mainly consists of two encoders: text encoder and layout encoder which are connected through the BiACM layer. (b) A visualization of the employed local (sliding window) and global attention mechanisms to process long sequences.
Figure 3: LongFin pretraining loss curve. The loss starts at 2.84 and oscillated between 1.97 and 1.94 near convergence.
Figure 4: Page 6 from an example document from the LongForms test set. Since the original document has 6 pages which can not fit in a single forward pass of 512 tokens, the document is split into several chunks, leading to a loss of important content. For example, in this table from the sixth page, the context from the top is crucial to decide whether to pick the net change in cash entity or not, since we are only interested to extract quarter information "Three months" periods only.

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

TL;DR

Abstract

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (4)