VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

Xuan-Quy Dao; Ngoc-Bich Le; The-Duy Vo; Xuan-Dung Phan; Bac-Bien Ngo; Van-Tien Nguyen; Thi-My-Thanh Nguyen; Hong-Phuoc Nguyen

VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

Xuan-Quy Dao, Ngoc-Bich Le, The-Duy Vo, Xuan-Dung Phan, Bac-Bien Ngo, Van-Tien Nguyen, Thi-My-Thanh Nguyen, Hong-Phuoc Nguyen

TL;DR

VNHSGE presents a Vietnamese, exam-grounded benchmark for evaluating large language models across nine school subjects, combining 19k MCQs with 300 literary essays and supporting image-based tasks. The dataset is derived from the Vietnamese Ministry of Education and Training materials, translated into English and provided in Word/JSON formats, with LaTeX-encoded math content to support reasoning. Empirical results using ChatGPT and BingChat show human-level performance in literature, English, history, geography, and civics in several cases, but persistent gaps in mathematics and the natural sciences, highlighting areas where LLMs need improved reasoning and calculation capabilities. By offering an annually updated, multi-format resource, VNHSGE aims to advance LLM development and robust evaluation in education, particularly for non-English, math-intensive domains.

Abstract

The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.

VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

TL;DR

Abstract

VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (44)