Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
Bruno Rigal, Victor Dupriez, Alexis Mignon, Ronan Le Hy, Nicolas Mery
TL;DR
This work targets the challenge of turning French PDF pages into Markdown for Retrieval-Augmented Generation pipelines by introducing a French-focused benchmark assembled via adversarial sampling across handwriting, forms, and complex layouts. It pairs a unit-test–based evaluation framework with normalization steps and a unified processing pipeline to compare 15 vision-language models, highlighting where errors arise and how layout and reading order affect downstream tasks. The study finds that proprietary models excel on difficult content such as handwriting and forms, while open-weight models perform relatively well on standard printed layouts; the main bottlenecks are reading-order and content preservation in layout-heavy pages. The benchmark is designed as a living resource to guide model development and evaluation in real-world document parsing for downstream AI systems.
Abstract
This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
