When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Virginie Mouilleron; Théo Lasnier; Djamé Seddah

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Virginie Mouilleron, Théo Lasnier, Djamé Seddah

TL;DR

The paper introduces Multimodal Finance Eval, a French-language benchmark for evaluating vision-language models on long, multimodal financial documents. It collects 1,204 questions covering text extraction, table reasoning, chart interpretation, and multi-turn dialogue anchored to document excerpts, and assesses six open-weight VLMs via an LLM-as-judge protocol. Results reveal strong extraction performance on text and tables but persistent weaknesses in chart interpretation and a pronounced error-propagation effect in multi-turn conversations, regardless of model size. The work highlights a gap between single-turn successes and robust multi-step reasoning in high-stakes finance, and provides a framework and dataset to drive progress toward more reliable, interactive document understanding in finance.

Abstract

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

TL;DR

Abstract

Paper Structure (30 sections, 17 figures, 2 tables)

This paper contains 30 sections, 17 figures, 2 tables.

Introduction
Related Works
French Evaluation Resources
NLP Work in the Finance Domain
Resource Construction
Dataset Collection
Task Overview and Dataset Composition
Dataset Statistics
Experimental Setup
Models.
Answer Generation.
Evaluation Protocol.
Results and Analysis
Discussion
Model Performance and Limitations
...and 15 more sections

Figures (17)

Figure 1: Overview of the Multimodal Finance Eval benchmark construction and evaluation pipeline. French financial documents (prospectuses, KIDs, PRIIPs) are collected from asset management companies, then processed to generate question-answer pairs spanning text, tables, and charts. Six vision-language models are evaluated on these tasks, with responses assessed using a majority-vote LLM-as-judge protocol.
Figure 2: Model accuracy on image-based question subcategories. Performance remains strong on table comprehension tasks (70--86%) but degrades substantially on chart interpretation (34--62%). Qwen3-VL-32B consistently outperforms other models across all visual modalities.
Figure 3: Model accuracy on text-based question subcategories by context length. All models achieve high performance (85--95%) on short and medium text contexts, with moderate degradation on larger contexts. Performance on tabular text (rightmost) remains competitive, indicating that text-based table comprehension is less challenging than image-based table interpretation.
Figure 4: Example table from a financial document.
Figure 5: Table comprehension example based on Figure \ref{['fig:table_example_image']}.
...and 12 more figures

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

TL;DR

Abstract

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (17)