Table of Contents
Fetching ...

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Virginie Mouilleron, Théo Lasnier, Djamé Seddah

TL;DR

The paper introduces Multimodal Finance Eval, a French-language benchmark for evaluating vision-language models on long, multimodal financial documents. It collects 1,204 questions covering text extraction, table reasoning, chart interpretation, and multi-turn dialogue anchored to document excerpts, and assesses six open-weight VLMs via an LLM-as-judge protocol. Results reveal strong extraction performance on text and tables but persistent weaknesses in chart interpretation and a pronounced error-propagation effect in multi-turn conversations, regardless of model size. The work highlights a gap between single-turn successes and robust multi-step reasoning in high-stakes finance, and provides a framework and dataset to drive progress toward more reliable, interactive document understanding in finance.

Abstract

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

TL;DR

The paper introduces Multimodal Finance Eval, a French-language benchmark for evaluating vision-language models on long, multimodal financial documents. It collects 1,204 questions covering text extraction, table reasoning, chart interpretation, and multi-turn dialogue anchored to document excerpts, and assesses six open-weight VLMs via an LLM-as-judge protocol. Results reveal strong extraction performance on text and tables but persistent weaknesses in chart interpretation and a pronounced error-propagation effect in multi-turn conversations, regardless of model size. The work highlights a gap between single-turn successes and robust multi-step reasoning in high-stakes finance, and provides a framework and dataset to drive progress toward more reliable, interactive document understanding in finance.

Abstract

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.
Paper Structure (30 sections, 17 figures, 2 tables)

This paper contains 30 sections, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Overview of the Multimodal Finance Eval benchmark construction and evaluation pipeline. French financial documents (prospectuses, KIDs, PRIIPs) are collected from asset management companies, then processed to generate question-answer pairs spanning text, tables, and charts. Six vision-language models are evaluated on these tasks, with responses assessed using a majority-vote LLM-as-judge protocol.
  • Figure 2: Model accuracy on image-based question subcategories. Performance remains strong on table comprehension tasks (70--86%) but degrades substantially on chart interpretation (34--62%). Qwen3-VL-32B consistently outperforms other models across all visual modalities.
  • Figure 3: Model accuracy on text-based question subcategories by context length. All models achieve high performance (85--95%) on short and medium text contexts, with moderate degradation on larger contexts. Performance on tabular text (rightmost) remains competitive, indicating that text-based table comprehension is less challenging than image-based table interpretation.
  • Figure 4: Example table from a financial document.
  • Figure 5: Table comprehension example based on Figure \ref{['fig:table_example_image']}.
  • ...and 12 more figures