Agentar-Fin-OCR

Siyi Qian; Xiongfei Bai; Bingtao Fu; Yichen Lu; Gaoyang Zhang; Xudong Yang; Peng Zhang

Agentar-Fin-OCR

Siyi Qian, Xiongfei Bai, Bingtao Fu, Yichen Lu, Gaoyang Zhang, Xudong Yang, Peng Zhang

TL;DR

Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications and evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents.

Abstract

In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.

Agentar-Fin-OCR

TL;DR

Abstract

Paper Structure (37 sections, 10 equations, 13 figures, 8 tables)

This paper contains 37 sections, 10 equations, 13 figures, 8 tables.

Introduction
Related Work
Traditional OCR Pipelines
General Vision Language Models
OCR Specialized Vision Language Models
Method
Overall Framework
Cross-page Contents Consolidation
Document-Level Heading Hierarchy Reconstruction
Notations
Reconstruction Pipeline
Pseudo-TOC Aggregation.
Document-Level Heading Hierarchy Reconstruction.
Curriculum Learning and Reinforcement Optimization for Table Parsing
Table Parsing with Cell-Level Visual Reference
...and 22 more sections

Figures (13)

Figure 1: Agentar-Fin-OCR overall architecture.
Figure 3: Architecture of CellBBoxRegressor. CellBBoxRegressor grounds each table cell by regressing its bounding box from decoder hidden states anchored at cell-start tokens.
Figure 4: Overview of FinDocBench. It contains six financial document categories and provides a comprehensive evaluation focusing on ultra-long document parsing, hierarchical heading reconstruction, and advanced table recognition.
Figure 5: Typical cases of each financial document sub-category.
Figure 6: Page count distribution in the heading hierarchy reconstruction part.
...and 8 more figures