Table of Contents
Fetching ...

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li

TL;DR

MeDocVL tackles the challenge of high-precision medical document parsing under noisy supervision by converting a general vision–language model into a domain specialist. It introduces Training-driven Label Refinement (TLR) to produce Refined Data from noisy OCR/MLLM outputs through staged pseudo-label construction, correction distillation, and large-scale refinement, followed by Noise-aware Hybrid Post-training (NHP) that blends token-level reinforcement learning with supervised fine-tuning. The learning objective emphasizes token-level alignment via token-wise GRPO and stabilizes updates with KL regularization, then consolidates precision with dynamic prompt augmentation during SFT on clean data. Experiments on a medical invoice benchmark show state-of-the-art field-level extraction performance, robustness to annotation noise, and clear gains from the full pipeline, indicating strong practicality for scalable domain adaptation in high-stakes document parsing.

Abstract

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

TL;DR

MeDocVL tackles the challenge of high-precision medical document parsing under noisy supervision by converting a general vision–language model into a domain specialist. It introduces Training-driven Label Refinement (TLR) to produce Refined Data from noisy OCR/MLLM outputs through staged pseudo-label construction, correction distillation, and large-scale refinement, followed by Noise-aware Hybrid Post-training (NHP) that blends token-level reinforcement learning with supervised fine-tuning. The learning objective emphasizes token-level alignment via token-wise GRPO and stabilizes updates with KL regularization, then consolidates precision with dynamic prompt augmentation during SFT on clean data. Experiments on a medical invoice benchmark show state-of-the-art field-level extraction performance, robustness to annotation noise, and clear gains from the full pipeline, indicating strong practicality for scalable domain adaptation in high-stakes document parsing.

Abstract

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
Paper Structure (48 sections, 15 equations, 8 figures, 3 tables)

This paper contains 48 sections, 15 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of query-driven document parsing performance. Public model is trained exclusively on publicly available datasets, while Extended model is trained with additional non-public data to study the effect of increased training data scale.
  • Figure 2: Overall pipeline of the proposed MeDocVL framework. Large-scale raw industrial data are highly relevant but unsuitable for direct training due to noisy or missing annotations. MeDocVL employs Training-driven Label Refinement to convert raw data into reliable domain supervision, followed by Noise-aware Hybrid Post-training to adapt a general-purpose VLM into a high-precision, domain-specialized model.
  • Figure 3: Stage 1 — Pseudo-label construction and prompt synthesis. Expert-annotated documents are processed by OCR systems and MLLMs to produce structured but imperfect key--value predictions. These predictions are converted into instruction-style prompts that explicitly expose annotation errors, enabling the refinement model to learn correction behaviors rather than generating labels from scratch.
  • Figure 4: Stage 2 — Correction distillation training. The Annotation Refinement Model is trained to revise pseudo labels by comparing its predictions against expert annotations. The training objective emphasizes learning systematic correction behaviors while preserving the general multimodal representations of the base model.
  • Figure 5: Stage 3 — Large-scale refinement for noise-aware post-training. The trained Annotation Refinement Model is applied to large-scale raw documents to transform noisy annotations into refined supervision. Rather than eliminating all errors, the resulting Refined Data reduces systematic bias while preserving realistic annotation variability, serving as a stable supervision source for noise-aware post-training.
  • ...and 3 more figures