MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li
TL;DR
MeDocVL tackles the challenge of high-precision medical document parsing under noisy supervision by converting a general vision–language model into a domain specialist. It introduces Training-driven Label Refinement (TLR) to produce Refined Data from noisy OCR/MLLM outputs through staged pseudo-label construction, correction distillation, and large-scale refinement, followed by Noise-aware Hybrid Post-training (NHP) that blends token-level reinforcement learning with supervised fine-tuning. The learning objective emphasizes token-level alignment via token-wise GRPO and stabilizes updates with KL regularization, then consolidates precision with dynamic prompt augmentation during SFT on clean data. Experiments on a medical invoice benchmark show state-of-the-art field-level extraction performance, robustness to annotation noise, and clear gains from the full pipeline, indicating strong practicality for scalable domain adaptation in high-stakes document parsing.
Abstract
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
