Table of Contents
Fetching ...

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

JaeSeong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

Abstract

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

Abstract

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

Paper Structure

This paper contains 74 sections, 10 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: EuraGovExam Dataset Construction Pipeline
  • Figure 2: Distribution of the EuraGovExam dataset.
  • Figure 3: Cross-regional performance analysis. (a) Model$\times$Region interaction heatmap: each cell shows $\Delta = \text{Region} - \text{Overall}$ accuracy; magenta = underperformance, blue = overperformance (values shown for $|\Delta| \geq 14$%p). Model names are color-coded by source type (blue: closed, orange: open). (b) Average accuracy by region with 95% Wilson CIs. (c) Ten most difficult domains.
  • Figure 4: Scaling analysis for open-source models. We analyze the relationship between parameter scale (in billions, $S$) and overall benchmark accuracy ($A$) for open-source VLMs using a linear regression in log scale. The dashed line shows the fitted trend, and the shaded region indicates a $\pm 1$ standard-deviation band of the regression residuals (an empirical $\pm 1$ SD band). We obtain $R^2=0.41$ with $p<0.01$, and observe an empirical trend of the form $A \approx 16.8\cdot \log(S)$.
  • Figure 5: Task difficulty by model tier (mean accuracy). We partition models into Top, Middle, and Bottom tiers by overall accuracy, and visualize per-task mean accuracy as a heatmap. This reveals whether the relative ordering of task difficulty is preserved as models improve, and which tasks remain persistent bottlenecks.
  • ...and 11 more figures