FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation
Yueru He, Xueqing Peng, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Ruoyu Xiang, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou
TL;DR
FinCriticalED introduces the first fact-level visual benchmark for financial OCR, addressing the critical need to preserve numerical and temporal facts in dense financial documents. It combines a 500-document, 739-fact annotated dataset with ground-truth HTML and an LLM-as-Judge evaluation pipeline to quantify factual correctness, evaluated across state-of-the-art OCR, open-source vision–language models, and proprietary systems. The study reveals that traditional lexical metrics fail to capture financial fidelity, with temporal facts generally more robust than numerical ones, and proprietary models achieving the highest factual accuracy while open models rapidly close the gap. Together, these contributions establish a rigorous, domain-focused foundation for assessing and advancing factual reliability in financial OCR and related precision-critical domains.
Abstract
We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. \ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.
