Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education
Duc-Vu Nguyen, Quoc-Nam Nguyen
TL;DR
The paper tackles the problem of evaluating symbol binding in Vietnamese MCQA by introducing ViGEText_17to23, a LaTeX-guided dataset spanning 2017–2023 VNHSGE, alongside existing datasets ViMMRC 1.0/2.0. It applies zero-shot, one-shot, and five-shot prompts to six LLMs including GPT-4, LLaMA-2, and BLOOMZ, treating MCQA as selecting the most probable token among $A$, $B$, $C$, and $D$. Key contributions include the creation of a high-quality, LaTeX-compliant evaluation corpus and a thorough cross-dataset evaluation that reveals GPT-4's superior MCSB performance, with LLaMA-2-70B often ranking second and BLOOMZ-7.1B-MT showing mixed results depending on prompting. The findings provide a standardized Vietnamese MCSB benchmark with implications for educational AI and NLP research, highlighting how prompt strategy and context length shape performance. These insights advance the development of Vietnamese symbolic reasoning evaluation and guide future data collection and model tuning efforts.
Abstract
In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
