Table of Contents
Fetching ...

EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering

Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, Preslav Nakov

TL;DR

The paper addresses the challenge of multilingual and cross-lingual high school science Question Answering by introducing EXAMS, a benchmark consisting of over 24,000 questions across 16 languages and 24 subjects, including nearly 10,000 parallel questions. It provides a fine-grained evaluation framework with multilingual and cross-lingual splits, plus subject-level analysis, and demonstrates results using state-of-the-art multilingual models like mBERT and XLM-R, highlighting limitations in reasoning and knowledge transfer. The work shows that multilingual fine-tuning yields meaningful gains on EXAMS, but substantial gaps remain in cross-language transfer and in leveraging external knowledge, motivating future domain-adaptive training and richer multilingual knowledge sources. The authors release the dataset, code, and pre-trained models to spur further research into robust multilingual domain-specific QA systems.

Abstract

We propose EXAMS -- a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. EXAMS offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models. We perform various experiments with existing top-performing multilingual pre-trained models and we show that EXAMS offers multiple challenges that require multilingual knowledge and reasoning in multiple domains. We hope that EXAMS will enable researchers to explore challenging reasoning and knowledge transfer methods and pre-trained models for school question answering in various languages which was not possible before. The data, code, pre-trained models, and evaluation are available at https://github.com/mhardalov/exams-qa.

EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering

TL;DR

The paper addresses the challenge of multilingual and cross-lingual high school science Question Answering by introducing EXAMS, a benchmark consisting of over 24,000 questions across 16 languages and 24 subjects, including nearly 10,000 parallel questions. It provides a fine-grained evaluation framework with multilingual and cross-lingual splits, plus subject-level analysis, and demonstrates results using state-of-the-art multilingual models like mBERT and XLM-R, highlighting limitations in reasoning and knowledge transfer. The work shows that multilingual fine-tuning yields meaningful gains on EXAMS, but substantial gaps remain in cross-language transfer and in leveraging external knowledge, motivating future domain-adaptive training and richer multilingual knowledge sources. The authors release the dataset, code, and pre-trained models to spur further research into robust multilingual domain-specific QA systems.

Abstract

We propose EXAMS -- a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. EXAMS offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models. We perform various experiments with existing top-performing multilingual pre-trained models and we show that EXAMS offers multiple challenges that require multilingual knowledge and reasoning in multiple domains. We hope that EXAMS will enable researchers to explore challenging reasoning and knowledge transfer methods and pre-trained models for school question answering in various languages which was not possible before. The data, code, pre-trained models, and evaluation are available at https://github.com/mhardalov/exams-qa.

Paper Structure

This paper contains 50 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Properties and examples from $E{\chi}\alpha{\mu}s$.
  • Figure 2: Relative sizes of the subjects. Those that cover less than 1.5% of the examples are in Other.
  • Figure 3: Relative sizes of reasoning types in $E{\chi}\alpha{\mu}s$.
  • Figure 4: Relative size of the $E{\chi}\alpha{\mu}s$ knowledge types.
  • Figure 5: Fine-grained evaluation by language and school subjects.
  • ...and 1 more figures