Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Duc-Vu Nguyen; Quoc-Nam Nguyen

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Duc-Vu Nguyen, Quoc-Nam Nguyen

TL;DR

The paper tackles the problem of evaluating symbol binding in Vietnamese MCQA by introducing ViGEText_17to23, a LaTeX-guided dataset spanning 2017–2023 VNHSGE, alongside existing datasets ViMMRC 1.0/2.0. It applies zero-shot, one-shot, and five-shot prompts to six LLMs including GPT-4, LLaMA-2, and BLOOMZ, treating MCQA as selecting the most probable token among $A$, $B$, $C$, and $D$. Key contributions include the creation of a high-quality, LaTeX-compliant evaluation corpus and a thorough cross-dataset evaluation that reveals GPT-4's superior MCSB performance, with LLaMA-2-70B often ranking second and BLOOMZ-7.1B-MT showing mixed results depending on prompting. The findings provide a standardized Vietnamese MCSB benchmark with implications for educational AI and NLP research, highlighting how prompt strategy and context length shape performance. These insights advance the development of Vietnamese symbolic reasoning evaluation and guide future data collection and model tuning efforts.

Abstract

In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

TL;DR

, and

. Key contributions include the creation of a high-quality, LaTeX-compliant evaluation corpus and a thorough cross-dataset evaluation that reveals GPT-4's superior MCSB performance, with LLaMA-2-70B often ranking second and BLOOMZ-7.1B-MT showing mixed results depending on prompting. The findings provide a standardized Vietnamese MCSB benchmark with implications for educational AI and NLP research, highlighting how prompt strategy and context length shape performance. These insights advance the development of Vietnamese symbolic reasoning evaluation and guide future data collection and model tuning efforts.

Abstract

Paper Structure (15 sections, 1 equation, 8 figures, 3 tables)

This paper contains 15 sections, 1 equation, 8 figures, 3 tables.

Introduction
Related Work
Datasets
ViMMRC 1.0
ViMMRC 2.0
Our Proposed Dataset: ViGEText_17to23
Experiments
Baseline models
Setup
Results
Experiments results on ViMMRC
Experiments results on our proposed dataset
Discussion
Conclusion and Future Work
Zero-Shot Prompts for Our Dataset

Figures (8)

Figure 1: A mathematics example of one-shot learning of our proposed dataset. In this one-shot learning example, there is one instruction example and one initially incomplete example.
Figure 2: Distribution of sequence lengths, measured in Vietnamese word units using VnCoreNLP vu-etal-2018-vncorenlp, for both raw data and preprocessed data, to ensure they do not exceed the maximum sequence length allowed by GPT-4.
Figure 3: Performance scores on ViMMRC 2.0 for LLMs with varying maximum sequence lengths and numbers of exemplars.
Figure 4: Performance average scores of LLMs on our proposed dataset from 2017 to 2023. The bottom of each column with a lighter shade denotes the second half of every test, as it is consistently more challenging than the initial half, as per the Vietnamese Ministry of Education.
Figure 5: A Mathematics example.
...and 3 more figures

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

TL;DR

Abstract

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Authors

TL;DR

Abstract

Table of Contents

Figures (8)