PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus

Junyuan Gao; Jiahe Song; Jiang Wu; Runchuan Zhu; Guanlin Shen; Shasha Wang; Xingjian Wei; Haote Yang; Songyang Zhang; Weijia Li; Bin Wang; Dahua Lin; Lijun Wu; Conghui He

PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus

Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He

TL;DR

PM4Bench tackles fair evaluation of multilingual LVLMs by introducing a strictly parallel multilingual multimodal benchmark across 10 languages and three tasks (MDUR, MIQA, MSOCR) with both traditional and vision input settings. It reveals that the vision setting substantially worsens performance and amplifies cross-lingual disparities, with OCR robustness emerging as a key bottleneck. Through a comprehensive evaluation of 10 LVLMs, PM4Bench shows that model scaling mitigates some cross-lingual gaps in vision, but OCR and script diversity remain central challenges. The work provides a practical, open benchmark for driving advances in multilingual OCR, cross-lingual alignment, and robust multimodal reasoning in LVLMs, with direct implications for real-world, language-diverse AI agents.

Abstract

While Large Vision-Language Models (LVLMs) demonstrate promising multilingual capabilities, their evaluation is currently hindered by two critical limitations: (1) the use of non-parallel corpora, which conflates inherent language capability gaps with dataset artifacts, precluding a fair assessment of cross-lingual alignment; and (2) disjointed multimodal inputs, which deviate from real-world scenarios where most texts are embedded within visual contexts. To address these challenges, we propose PM4Bench, the first Multilingual Multi-Modal Multi-task Benchmark constructed on a strictly parallel corpus across 10 languages. By eliminating content divergence, our benchmark enables a fair comparison of model capabilities across different languages. We also introduce a vision setting where textual queries are visually fused into images, compelling models to jointly "see," "read," and "think". Extensive evaluation of 10 LVLMs uncover a substantial performance drop in the Vision setting compared to standard inputs. Further analysis reveals that OCR capability is not only a general bottleneck but also contributes to cross-lingual performance disparities, suggesting that improving multilingual OCR is essential for advancing LVLM performance. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .

PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus

TL;DR

Abstract

PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)