On the robustness of ChatGPT in teaching Korean Mathematics

Phuong-Nam Nguyen; Quang Nguyen-The; An Vu-Minh; Diep-Anh Nguyen; Xuan-Lam Pham

On the robustness of ChatGPT in teaching Korean Mathematics

Phuong-Nam Nguyen, Quang Nguyen-The, An Vu-Minh, Diep-Anh Nguyen, Xuan-Lam Pham

TL;DR

This study evaluates the robustness of ChatGPT in solving Korean-language mathematics problems drawn from the CSAT, using a dataset of $586$ questions and assessing both solving accuracy ($66.72\%$) and the model's ability to rate/c categorize questions. It analyzes eleven rating criteria via Likert scales, conducts topic analysis, and examines correlations between predicted and actual difficulty, cognitive demand, time, and engagement. Key findings show strong alignment between ChatGPT's ratings and educational theory, but weaknesses in sequential reasoning and diagram-based tasks; bilingual prompt strategies and multi-step prompting can modestly improve performance. The work provides actionable insights for applying LLMs in multilingual math education, including adaptive-learning opportunities and design considerations for diagram-rich content, while outlining directions for future research and dataset development. The results underscore the potential and limitations of deploying AI-assisted tools in non-English, STEM-rich educational settings and highlight avenues to enhance accuracy and inclusivity through language-aware fine-tuning and domain-specific data.

Abstract

ChatGPT, an Artificial Intelligence model, has the potential to revolutionize education. However, its effectiveness in solving non-English questions remains uncertain. This study evaluates ChatGPT's robustness using 586 Korean mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering 391 out of 586 questions. We also assess its ability to rate mathematics questions based on eleven criteria and perform a topic analysis. Our findings show that ChatGPT's ratings align with educational theory and test-taker perspectives. While ChatGPT performs well in question classification, it struggles with non-English contexts, highlighting areas for improvement. Future research should address linguistic biases and enhance accuracy across diverse languages. Domain-specific optimizations and multilingual training could improve ChatGPT's role in personalized education.

On the robustness of ChatGPT in teaching Korean Mathematics

TL;DR

This study evaluates the robustness of ChatGPT in solving Korean-language mathematics problems drawn from the CSAT, using a dataset of

questions and assessing both solving accuracy (

) and the model's ability to rate/c categorize questions. It analyzes eleven rating criteria via Likert scales, conducts topic analysis, and examines correlations between predicted and actual difficulty, cognitive demand, time, and engagement. Key findings show strong alignment between ChatGPT's ratings and educational theory, but weaknesses in sequential reasoning and diagram-based tasks; bilingual prompt strategies and multi-step prompting can modestly improve performance. The work provides actionable insights for applying LLMs in multilingual math education, including adaptive-learning opportunities and design considerations for diagram-rich content, while outlining directions for future research and dataset development. The results underscore the potential and limitations of deploying AI-assisted tools in non-English, STEM-rich educational settings and highlight avenues to enhance accuracy and inclusivity through language-aware fine-tuning and domain-specific data.

On the robustness of ChatGPT in teaching Korean Mathematics

TL;DR

Abstract

On the robustness of ChatGPT in teaching Korean Mathematics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)

Theorems & Definitions (8)